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Abstract. Consider a dataset of vector-valued observations that consists of a modest number of 
noisy inliers, which are explained well by a low-dimensional subspace, along with a large number of 
outliers, which have no linear structure. This work describes a convex optimization problem, called 
REAPER, that can reliably fit a low-dimensional model to this type of data. The paper provides 
^ an efficient algorithm for solving the reaper problem, and it documents numerical experiments 

^— ^ which confirm that reaper can dependably find linear structure in synthetic and natural data. In 

addition, when the inliers are contained in a low-dimensional subspace, there is a rigorous theory 
^ that describes when reaper can recover the subspace exactly. 

OO 

^ 1. Introduction 

I— I Low-dimensional linear models have applications in a huge array of data analysis problems. Let 

^ us highlight some examples from computer vision, machine learning, and bioinformatics. 

Illumination models: Images of a face — or any Lambertian object — viewed under different illu- 
, mination conditions are contained in a nine- dimensional subspace jEHY95[ lHYL+n3[ lBJn3] . 

^ Structure from motion: Feature points on a moving rigid body lie on an affine space of dimen- 

j>. sion three, assuming the affine camera model |CK98] . More generally, estimating structure 

from motion involves estimating low-rank matrices lEvdHlOl Sec. 5.21. 

Latent semantic indexing: We can describe a large corpus of documents that concern a small 
number of topics using a low-dimensional linear model DDL"*" 88] . 



Population stratification: Low-dimensional models of single nucleotide polymorphism (SNP) 
^ data have been used to show that the genotype of an individual is correlated with her 

T— I geographical ancestry |NJB+08] . More g enerally, l inear models are used to assess differences 

in allele frequencies among populations PPP"'"06] . 



^ In most of these applications, the datasets are noisy, and they contain a substantial number of 

5^ outliers. Principal component analysis, the standard method for finding a low-dimensional linear 

^ model, is sensitive to these non-idealities. As a consequence, good robust modeling techniques 

would be welcome in a range of scientific and engineering disciplines. 

The purpose of this work is to introduce a new method for fitting low-dimensional linear models. 
The approach is robust against noise in the inliers, and it can cope with a huge number of outliers. 
Our formulation involves a convex optimization problem, and we describe an efficient numerical 
algorithm that is guaranteed to produce an approximate solution after a modest number of spectral 
calculations. We include experiments with synthetic and natural data to verify that our technique 
reliably seeks out linear structure. Furthermore, under the ideal assumption that the inliers are 
contained in a subspace, we develop sufficient conditions for the method to identify the subspace 
without error. 
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1.1. Notation and Preliminaries. We write ||-|| for the H.2 norm on vectors and the spectral 
norm on matrices; ||-||p represents the Frobenius norm. Angle brackets (•, •) denote the standard 
inner product on vectors and matrices, and tr refers to the trace. The orthogonal complement 
of a subspace L is expressed as L-*-. The curly inequality ^ denotes the semidefinite order: For 
symmetric matrices A and B, we write A ^ B \i and only if S — A is positive semidefinite. 

An orthoprojector is a symmetric matrix 11 that satisfies 11^ = 11. Each subspace L in is the 
range of a unique Dx D orthoprojector II^,. The trace of an orthoprojector equals the dimension of 
its range: tr(ni) = dim(L). For each point x G M'^, the image TIlx is the best £2 approximation 
of X in the subspace L. 

Finally, we introduce the spherization transform for vectors: 

£:=|"/"""' (1.1) 
[0, otherwise. 

We extend the spherization transform to matrices by applying it separately to each column. 

1.2. Linear Modeling by Principal Component Analysis. Let be a set of N distinct 
observations in M.^ . Suppose we wish to determine a d-dimensional subspace that best explains the 
data. For each point, we can measure the residual error in the approximation by computing the 
orthogonal distance from the point to the subspace. The classical method for fitting a subspace 
asks us to minimize the sum of the squared residuals: 

minimize ^ ||a; — na;||^ subject to H is an orthoprojector and tvll = d. (1-2) 

This approach is equivalent with the method of principal component analysis (PCA) from the 
statistics literature |Jol02| and the total least squares (TLS) method from the linear algebra com- 
munity |vHV87j . 



The mathematical program (1.2) is not convex because orthoprojectors do not form a convex 
set, so we have no right to expect that the problem is tractable. Nevertheless, we can compute an 
analytic solution using a singular value decomposition (SVD) of the data |EY391 lvHV87j . Suppose 
that X is a D X N matrix whose columns are the data points, arranged in fixed order, and let 
X = LfSV^ be an SVD of this matrix. Define by extracting the first d columns of U; the 
columns of Ua are often called the principal components of the data. Then we can construct an 



optimal point H^, for (1.2) using the formula = UdU^ 



1.3. Classical Methods for Achieving Robustness. Imagine now that the dataset X contains 
inliers, points we hope to explain with a linear model, as well as outliers, points that come from 
another process, such as a different population or noise. The data are not labeled, so it may be 



challenging to distinguish inliers from outliers. If we apply the PCA formulation (1.2) to fit a 
subspace to X, the rogue points can interfere with the linear model for the inliers. 

To guard the subspace estimation procedure against outliers, statisticians have proposed to 



replace the sum of squares in (1.2) with a figure of merit that is less sensitive to outliers. One 
possibility is to sum the unsquared residuals, which reduces the contribution from large residuals 
that may result from aberrant data points. This idea leads to the following optimization problem. 

minimize ^ ||a; — na;|| subject to 11 is an orthoprojector and 1^11 = d. (1-3) 



In case d = D — 1, the problem ( |1.3[ ) is sometimes called orthogonal i\ regression |SW87j or 
least orthogonal absolute deviations |Ny q88| . The extension to general d is apparently more re- 
cent IWatnUlDZHZnH] . See the books [ HRM IRL87L IMMYOBj for an extensive discussion of other 
ways to combine residuals to obtain robust estimators. 



Unfortunately, the mathematical program (1.3) is not convex, and, in contrast to (1.2), no deus 



ex machina emerges to make the problem tractable. Although there are many algorithms |Nyq88[ 
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Algorithm 1.1 Prototype algorithm for robust computation of a linear model 
Input: 

• A set X of observations in 

• The target dimension d for the linear model, where < d < D 
Output: 

• A [d] -dimensional subspace of 
Procedure: 



1 [Opt.] Solve (1.5) to obtain a center c^,, and center the data: a; — for each x G X 



2 [Opt.] Spherize the data: x xj for each nonzero x ^ X 



3 Solve (1.4) with dataset X and parameter d to obtain an optimal point 



4 Output a dominant [d] -dimensional invariant subspace of P^, 



ICM911 IBH931 IWat02l IDZHZ061 IZSLOQj that attempt (jTs]), none is guaranteed to return a global 
minimum. In fact, most of the classical proposals for robust linear modeling involve intractable 
optimization problems, which makes them poor options for computation in spite of their theoretical 
properties |MMY06| . 

1.4. A Convex Program for Robust Linear Modeling. The goal of this paper is to develop, 
analyze, and test a rigorous method for fitting robust linear models by means of convex optimization. 



We propose to relax the hard optimization problem (1.3) by replacing the nonconvex constraint set 



with a larger convex set. The advantage of this approach is that we can solve the resulting convex 
program completely using a variety of efficient algorithms. 

The idea behind our relaxation is straightforward. Each eigenvalue of an orthoprojector 11 equals 
zero or one because 11^ = 11. Although a 0-1 constraint on eigenvalues is hard to enforce, the 
symmetric matrices whose eigenvalues lie in the interval [0, 1] form a convex set. This observation 
leads us to frame the following convex optimization problem. Given a dataset X in and a target 
dimension d E (0, -D) for the linear model, we solve 



minimize 



El 



X 



Px\ subject to O^P^I and \.rP = d. 



(1.4) 



We refer to (1.4) as reaper because it harvests linear structure from data. The formulation (1.4) 



refines a proposal from the paper [ZLll], which describes a looser relaxation of (1.3). 



In many cases, the optimal set of (1.4) consists of a single orthoprojector whose range provides 



an excellent fit for the data. The bulk of this paper provides theoretical and empirical support for 
this observation. In Section |4| we present a numerical algorithm for solving (1.4). 



1.4.1. Some Practical Matters. Although the reaper formulation is effective on its own, we can 



usually obtain better linear models if we preprocess the data before solving (1.4). Let us summarize 
the recommended procedure, which appears as Algorithm |1.1[ 

First, the reaper problem assumes that the inliers are approximately centered. When they are 
not, it is important to identify a centering point for the dataset and to work with the centered 
observations. We can compute a centering point c^, robustly by solving the Euclidean median 
problem ^HROOl [MMY06] : 

- |. (1.5) 



minimize 



X 



It is also possible to incorporate centering by modifying the optimization problem (1.4); see Sec- 
tion [6lL5] for more information. 

Second, the reaper formulation can be sensitive to outliers with large magnitude. A simple 
but powerful method for addressing this challenge is to spherize the data points before solving the 
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optimization problem. For future reference, we write down the resulting convex program. 

minimize \\x — Px\\ subject to O^P^l and trP = (i. (1-6) 



The tilde denotes the spherization transform (1.1). We refer to (1.6) as the s-reaper problem. In 



most (but not all) of our experimental work, we have found that s-reaper outperforms reaper. 
The idea of spherizing data before fitting a subspace was proposed in the paper [L MS^99j . where it 
is called spherical PC A. Spherical PCA is regarded as one of the most effective current techniques 
for robust linear modeling |MMY06] . 

Finally, we regard the parameter d in reaper and s-REAPER as a proxy for the dimension of 
the linear model. While the rank of an optimal solution 
than d because of the constraints trP 



to (1.4) or (1.6) cannot be smaller 
d and P ^ I, the rank of P* often exceeds d. Our 
numerical experience indicates that the dominant eigenvectors of Pi, do not change much as we 
adjust d. Therefore, if a model with exact dimension d is required, we recommend using a dominant 



d-dimensional invariant subspace of P^,. See Section 6.1.6 for alternative approaches. 
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Figure 1.1. A needle in a haystack. The needle dataset is described in Section 1.5 



[Top] The projection onto a random two-dimensional subspace. [Bottom] The pro- 
jection onto the oracle two-dimensional subspace obtained by applying PCA to the 
inliers only. The blue squares mark the projection of the inliers onto the one- 
dimensional model computed with s-reaper (1.6). 
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1.5. Finding a Needle in a Haystack. The reaper and s-reaper formulations work aston- 
ishingly well. They can reliably identify linear structure in situations where the task might seem 
impossible. As a first illustration, we examine the problem of "finding a needle in a haystack." In 
Section [5| we present a more comprehensive suite of experiments. 

Let the ambient dimension D = 100. Consider a dataset X in MP that consists of 6 points arrayed 
arbitrarily on a line through the origin (the needle) and 200 points drawn from a centered normal 
distribution with covariance D~^\ (the haystack). The 6 inliers have strong linear structure, but 
they comprise only 3% of the data. The 200 outliers, making up the bulk of the sample, typically 
have no linear structure at all. We solve the optimization problem s-reaper with the parameter 
d = 1 to obtain an optimal point -P^, which happens to be a rank-one orthoprojector. 

Figure |1.1| displays the outcome of this experiment. We see that the range of the s-reaper 
model Pi, identifies the direction of the needle with negligible error] Of course, this feat is not 
magic. Even though the needle is impossible to see in most two-dimensional projections of the 
data, it becomes visible once we identify the direction of the needle. We find it remarkable that 
S-reaper renders the search for this direction into a convex problem. 
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Figure 1.2. A syringe in a haystack. The syringe dataset is described in Sec- 
tion 1.5 [Top] The projection onto a random two-dimensional subspace. [Bottoin\ 
The projection onto the oracle two-dimensional model determined by applying PCA 
to the inliers only. The blue squares mark the projection of the inliers onto the one- 
dimensional model computed with s-reaper (1.6). 
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Our method for fitting linear models is robust to noise in the inliers, although it may require more 
data to fit the model accurately. To illustrate, we consider a noisy version of the same problem. 
We retain the ambient dimension D = 100. Each of 10 inliers takes the form Xi = giV + Zi 
where v is a fixed vector, gi are i.i.d. standard normal variables, and the noise vectors Zj are 
i.i.d. NORMAl(0, {16D)~^ T). The inliers form a tube around the vector v, so they look like a 
syringe. As before, the outliers consist of 200 points that are i.i.d. normal(0, D^^T). We apply 
S-REAPER with d = 1 to the full dataset to obtain a solution P^, and we use a dominant eigenvector 
U-^ of as our one-dimensional model. To find the optimal two-dimensional model for the inliers. 



we solve ( 1.2 ) with d = 2 using the inliers only to obtain two oracle principal components. Figure 1.2 
illustrates the outcome of this experiment. The angle between the first oracle principal component 
and the s-REAPER model u^, is approximately 3.60°. 



1.6. The Haystack Model. Inspired by the success of these experiments, let us attempt to place 
them on a firm theoretical foundation. To that end, we propose a very simple generative random 
model. We want to capture the idea that the inliers admit a low-dimensional linear model, while 
the outliers are totally unstructured. 

Table 1.1. The Haystack Model. A generative ran- 
dom model for data with linear structure that is con- 
taminated with outliers. 



D 


Dimension of the ambient space 


L 


A proper d-dimensional subspace of containing the inliers 


iVin 


Number of inliers 


iVout 


Number of outliers 


Pin 


Inlier sampling ratio pin := N\^Jd 


Pout 


Outlier sampling ratio pout := ^out/D 




Variance of the inliers 




Variance of the outliers 




Set of A'^in inliers, drawn i.i.d. normal(0, {crf^/d) Hl) 




Set of A^out outliers, drawn i.i.d. normal(0, {cTout/D) ^d) 




The set Xi^i U Xo^^t containing all the data points 



There are a few useful intuitions associated with this model. As the inlier sampling ratio pin 
increases, the inliers fill out the subspace L more completely so the linear structure becomes more 
evident. As the outlier sampling ratio pout increases, the outliers become more distracting and they 
may even start to exhibit some linear structure due to chance. Next, observe that we have scaled 
the points so that their energy is independent of the dimensional parameters: 

E ||a;||^ = (Tjn for x G Afjn and E ||a;||^ = a'^^^ for x G A^out- 

As a result, when af^ = Couf cannot screen outliers just by looking at their energy. The sampling 
ratios and the variances contain most of the information about the behavior of this model. 

1.7. Exact Recovery under the Haystack Model. We have been able to establish conditions 
under which reaper and s-reaper provably find the low-dimensional subspace L in the Haystack 
Model with high probability. A sufficient condition for exact recovery is that pin, the number of 
inliers per subspace dimension, should increase linearly with pout, the number of outliers per ambient 
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dimension. As a consequence, we can find a low-dimensional linear structure in a high-dimensional 
space given a tiny number of examples, even when the number of outliers seems exorbitant. 

Theorem 1.1 (Exact Recovery for the Haystack Model). Fix a number /3 > 0, and assume that 
^ ^ d < {D — l)/2. Let L be an arbitrary d-dimensional subspace ofK^, and draw the dataset X 
at random according to the Haystack Model on page 

(i) Suppose that the sampling ratios and the variances satisfy the relation 



Pm>Ci + C2/3 + C3 



Cout 



<7i, 



(/Oout + 1 + 4/3) . 



(1.7) 



Then YIl is the unique solution to the REAPER problem (1.4) except with probability 4e . 
(ii) Suppose that the sampling ratios satisfy the relation 

Pin >Ci + C2 /3 + C3 Pout. (1.8) 

ThenYli is the unique solution to the S-REAPER problem (1.6) except with probability A e~^'^ . 
The numbers Cj and Cj are positive universal constants. 

Theorem 11.1 



is a consequence of a deterministic exact recovery condition that we discuss more 
fully in Section 3| We obtain Theorem 1.1 when we compute the values of some summary statistics, 
assuming that the data is drawn from the Haystack Model. See Appendix [B] for the details of this 
argument, including tighter conditions that hold for the full parameter range 1 < d < D — 1. 



For clarity, we have suppressed the values of the constants in (1.7) and (1.8). Our proof yields 



reasonable numerical estimates. For example, Ci < 13 and C2 < 7 and C3 < 16. Experiments 
indicate that C3, the constant of proportionality between and pout, is less than two. 



1.7.1. A Needle in a Haystack. To complement the experiments in Section 1.5, we include a quan- 
titative result on the behavior of s-reaper for finding a needle in a haystack. 

Fact 1.2. Assume the ambient dimension D > 50, and fix a one-dimensional subspace L. Suppose 
the number of inliers A'in > 13 and the number of outliers A'^out = 21). Draw a dataset X according 



to the Haystack Model on page[6| Then Ul is the unique solution to the s-reaper problem (1.6) 
at least 99% of the time. 



Fact 1 1.2| follows from Theorem B.l We highlight this ad hoc result to underscore the point that 



it takes only a few good observations for (1.6) to identify inliers that fall on a line. Indeed, Fact 1.2 
is valid even in dimension D = 10'^^ with Aont = 2 x 10 



^3^ 



1.7.2. Is it Hard to Find a Needle in a Haystack? Our low-dimensional intuition suggests that it 
might be difficult to identify the subspace in the Haystack Model, but there are several algorithms 
that can complete the task. For instance, the EM algorithm for normal mixtures has no trouble 
with this example [Reel 2] . In contrast, for natural data, we have found that the s-reaper method 
often outperforms other methods by a reasonable margin. 

Let us emphasize that we do not intend for the Haystack Model to describe natural data. We 
have introduced this formulation as a way to probe the limitations of the reaper and s-reaper 
problems in experiments. The Haystack Model also gives us some theoretical insight on their exact 
recovery behavior. We believe that Theorem |1.1| is compelling in part because the optimization 
problems are derived without knowledge of the model. It would be valuable to analyze reaper 
and S-REAPER with more sophisticated data models, but we leave this task for future work. 
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1.8. Summary of Contributions. This work partakes in a larger research vision: Given a difficult 
nonconvex optimization problem, it is often more effective to solve a convex variant than to accept 
a local minimizer of the original problem. 

We believe that the main point of interest in this paper is our application of convex optimization 
to solve a problem involving subspaces. There are two key observations here. We parameterize 
subspaces by orthoprojectors, and then we replace the set of orthoprojectors with its convex hull 
{P : ^ -P ^ I}. Neither of these ideas is new, but this research provides clear evidence that they 
can be used to address important geometric questions in data analysis. 

To support the claim that our approach works, we develop a deterministic analysis that describes 
when REAPER and S-REAPER can recover a subspace exactly. This analysis yields summary statistics 
that predict when the optimization problems succeed. For data that conform to the Haystack 
Model on page[6j we can evaluate these summary statistics to obtain the exact recovery guarantees 
of Theorem 11.11 See Section |3] for more information. 

In Section|4j we present an efficient iterative algorithm for solving the semidefinite program (1.4). 



The algorithm typically requires a modest number of spectral computations, so it scales to relatively 
large problem instances. 

In Section [5j we use our algorithm to perform some experiments that describe the behavior of 
REAPER and S-REAPER. First, we study synthetic data problems to understand the limitations of 



S-REAPER; these experiments indicate that Theorem 1.1 is qualitatively correct. We also describe 
some stylized applications involving natural data; these results show that s- reaper dominates other 
robust PCA techniques in several situations. Furthermore, the linear models we obtain appear to 
generalize better than the robust models obtained with some other methods. 

We describe a variety of future research directions in Section [6| In particular, we propose some 
alternative formulations of the reaper problem. We also suggest some ways to extend our approach 
to manifold learning and hybrid linear modeling. 

The technical details of our work appear in the appendices. 

2. Previous Work 

Robust linear modeling has been an active subject of research for over three decades. Although 
many classical approaches have strong robustness properties in theory, the proposals usually involve 
intractable computational problems. More recently, researchers have developed several techniques, 
based on convex optimization, that are computationally efficient and admit some theoretical guar- 
antees. In this section, we summarize classical and contemporary work on robust linear modeling, 
with a focus on the numerical aspects. We recommend the books [HROQ^ iMMY06^ IRL87j for a 
comprehensive discussion of robust statistics. 

2.1. Classical Strategies for Robust Linear Modeling. We begin with an overview of the 
major techniques that have been proposed in the statistics literature. The theoretical contributions 
in this area focus on breakdown points and influence functions of estimators. Researchers tend to 
devote less attention to the computational challenges inherent in these formulations. 

2.1.1. Robust Combination of Residuals. Historically, one of the earliest approaches to linear re- 
gression is to minimize the sum of (nonorthogonal) residuals. This is the principle of least absolute 
deviations (LAD). Early proponents of this idea include Galileo, Boscovich, Laplace, and Edge- 
worth. See |Har74al IHar74b| IDodBTj for some historical discussion. It appears that orthogonal 
regression with LAD was first considered in the late 1980s |()W85[ ISW87I |Nyq88| ; the extension 



from orthogonal regression to PCA seems to be even more recent |Wat01t IDZHZ06] . LAD has also 
been considered as a method for hybrid linear modeling in [ZSL091 ILZllj . We are not aware of 
tractable algorithm for these formulations. 

There are many other robust methods for combining residuals aside from LAD. An approach 
that has received wide attention is to minimize the median of the squared residuals |Rou84| IRL87] . 
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Other methods appear in the books [HR091 IMMY06] . These formulations are generally not com- 
putationally tractable. 

2.1.2. Robust Estimation of Covariance Matrix. Another standard technique for robust PCA is to 
form a robust estimate of the covariance matrix of the data |Mar761 fHROQl IMMY061 IDGK8H lDav87| 
IXY95t ICHOOj . The classical robust estimator of covariance is based on the maximum likelihood 
principle, but there are many other approaches, including S-estimators, the minimum covariance 
determinant (MOD), the minimum volume ellipsoid (MVE), and the Stahel-Donoho estimator. 
See [MMY061 Sec. 6] for a review. We are not aware of any scalable algorithm for implementing 
these methods that offers a guarantee of correctness. 

2.1.3. Projection Pursuit PCA. Projection pursuit (often abbreviated PP-PCA) is a procedure 
that constructs principal components one at a time by finding a direction that maximizes a ro- 
bust measure of scale, removing the component of the data in this direction, and repeating the 
process. The initial proposal appears in |HR09l 1st ed.], and it has been explored by many other 
authors |L(X^[ IAmin93l K^FOOTl IKwanSl IWTHn9j . We are aware of only one formulation that 
provably (approximately) maximizes the robust measure of scale at each iteration |MTllj . but 
there are no overall guarantees for this algorithm. 

2.1.4. Screening for Outliers. Another common approach is to remove possible outliers and then 
estimate the underlying subspace by PCA |CW821 fTBOH lTB03j . The classical methods offer very 
limited guarantees. There are some recent algorithms that are provably correct |Bru09t IXC M 10) 
under model assumptions. 

2.1.5. RANSAC. The randomized sample consensus (RANSAC) method is a randomized iterative 
procedure for fitting models to noisy data consisting of inliers and outliers |FB81| . Under some 
assumptions, RANSAC will eventually identify a linear model for the inliers, but there are no 
guarantees on the number of iterations required. 

2.1.6. Spherical PCA. A useful method for fitting a robust linear model is to center the data 
robustly, project it onto a sphere, and then apply standard PCA. This approach is due to [LMS+99] . 
Maronna et al. jMMYOG] recommend it as a preferred method for robust PCA. The technique is 
computationally practical, but it has limited theoretical guarantees. 

2.2. Approaches Based on Convex Optimization. Recently, researchers have started to de- 
velop effective techniques for robust linear modeling that are based on convex optimization. These 
formulations invite a variety of tractable algorithms, and they have theoretical guarantees under 
appropriate model assumptions. 

2.2.1. Deconvolution Methods. One class of techniques for robust linear modeling is based on split- 
ting a data matrix into a low-rank model plus a corruption. The first approach of this form is 
due to Chandrasekaran et al. [CSPWlT] . Given an observed matrix X, they propose to solve the 
semidefinite problem 

minimize ||P||5.^ + 7 ||C||^^ subject to X = P + C. (2.1) 

Minimizing the Schatten 1-norm IHI^^ promotes low rank, while minimizing the vector £1 norm 
promotes sparsity. The regularization parameter 7 negotiates a tradeoff between the two goals. 



Candes et al. |CLMWTT] study the performance of (2.1) for robust linear modeling in the setting 
where individual entries of the matrix X are subject to error. 

A related proposal is due to Xu et al. [XCSlOal IXC SI Ob] and independently to McCoy and 
Tropp |MTllj . These authors recommend solving the decomposition problem 

minimize H-Plls^ + 7 ||C||*^2 subject to X = P + C. (2.2) 
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The norm 



ll-s>2 



returns the sum of Euchdean norms of the columns of its argument. This formu- 



lation is appropriate for inlier-outlier data models, where entire columns of the data matrix may 
be corrupted. 



Both (2.1) and (2.2) offer some theoretical guarantees under appropriate model assumptions. 



The most common algorithmic framework for these problems depends on the alternating direction 
method of multipliers (ADMM), which is a type of augmented Lagrangian scheme. These algorithms 
converge slowly, so it takes excessive computation to obtain a high-accuracy solution. 



2.2.2. Precedents for the reaper Problem. The reaper problem (1.4) is a semidefinite relaxation 
of the ii orthogonal distance problem (1.3). Our work refines a weaker relaxation of (1.3) proposed 
by Zhang and Lerman: 



minimize 



El 



X 



Px\\ subject to trP = l. 



Some of the ideas in the theoretical analysis of reaper are drawn from |LZ10l IZLllj . The IRLS 
algorithm for (1.4) and the convergence analysis that we present in Section [4] are also based on the 
earlier work. 

The idea of relaxing a difficult nonconvex program to obtain a convex problem is well established 
in the literature on combinatorial optimization. Research on linear programming relaxations is 
summarized in [Vaz03] . Some significant works on semidefinite relaxation include |LS9H IGW95j . 



3. Deterministic Conditions for Exact Recovery of a Subspace 

In this section, we develop a deterministic model for a dataset that consists of inliers located 
in a fixed subspace and outliers that may appear anywhere in the ambient space. We introduce 
summary statistics that allow us to study when REAPER or S- REAPER can recover exactly the 
subspace containing the inliers. The result for the Haystack Model, Theorem |1.1[ follows when we 
estimate the values for these summary statistics. 

3.1. A Deterministic Model for Studying Exact Recovery of a Subspace. We begin 
with a deterministic model for a dataset that consists of low-dimensional inliers mixed with high- 
dimensional outliers. 



Table 3.1. The In k Out Model. A deterministic 
model for data with linear structure that is contami- 
nated with outliers. 



D 


Dimension of the ambient space 


L 


A proper d-dimensional subspace of 


iVin 


Number of inliers 


Nont 


Number of outliers 


■^in 


Set of A'in inliers, at arbitrary locations in L 




Set of A'out outliers, at arbitrary locations in \ L 


X 


Set Afin U A'out containing all the observations 



The key point about this model is that the inliers are all contained within the subspace L, so it 
is reasonable to investigate when we can identify L exactly. For notational convenience, we assume 
that no observation is repeated, but this assumption plays no role in our analysis. The Haystack 
Model on page [6] almost surely produces a dataset that conforms to the In & Out Model. 
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3.2. Summary Parameters for the In & Out Model. In general, we cannot hope to identify 
the subspace L that appears in the In & Out Model without further assumptions on the data. 
This section describes some summary statistics that allow us to check when reaper or s-reaper 
succeeds in computing L. These quantities play a role analogous to the coherence statistic in the 
sparsity literature |DH01j . 

First, let us discuss what property of the inliers might allow us to determine the subspace L that 
contains them. Roughly, we need the inlying observations to permeate L. To quantify this idea, 
we define the permeance statistic (with respect to L) by the formula 

V{L):= inf (3-1) 

\\u\\=l S'GA'in 

Similarly, we define the spherical permeance statistic as 

V{L):= inf K"' (3-2) 

||u|j=l a^eA'in 

When V{L) = (equivalently, V{L) = 0), there is a vector in the subspace L that is orthogonal to 
all the inliers. In this case, the data do not contain enough information for us to recover L. On the 
other hand, when one of the permeance statistics is large, the inliers corroborate each direction in 
the subspace. In this case, we expect our methods for linear modeling to be more effective. 

To ensure recovery of L, we must also be certain that the outliers do not exhibit linear structure. 
Otherwise, there is no way to decide whether a subspace captures linear structure from the inliers 
or the outliers. Indeed, in the extreme case where the outliers are arranged on a line, any sensible 
estimation procedure would have to select a subspace that contains this line. 

Let us introduce a quantity that measures the linear dependency of the outliers within a given 
subspace. The linear structure statistic is defined for each subspace M C by the formula 

5(M)2:= sup Y \{u, x)\^ = \\UMX,^t\\\ (3.3) 

||„||=1 ^fc-^^out 

where Xout is a I? x A^out matrix whose columns are the outliers, arranged in fixed order. Similarly, 
we can define the spherical linear structure statistic 

5(M)2:= sup Y \iu,U^)^ = \\U!^^xZtf, (3.4) 
||it||=i ^e-^^out 

where IlAfXout is a x A'^out matrix whose columns are the vectors 11^/3^ for each x £ Afout- The 
linear structure statistics tend to be large when the outliers are nearly collinear (after projection 
onto M), while the statistics tend to be small when the outliers are close to orthogonal. 

3.3. Conditions for Exact Recovery of a Subspace. We are now prepared to state conditions 
when either REAPER or S-REAPER recovers the subspace L containing the inliers exactly. 

Theorem 3.1. Let L be a d- dimensional subspace of MP , and assume that X is a dataset that 



follows the In & Out Model on page 10 
(i) Suppose that 



V{L) > V2d-S{L^)-S{R^). (3.5) 



Then Hi is the unique minimizer of the REAPER problem (1.4). 
(ii) Suppose that 



V{L)> V2d-S{L^)-S{R^). (3.6) 



Then Hi is the unique minimizer of the S-REAPER problem (1.6) 
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The permeance statistics V and V are defined in (3.1 ) and (3.2 ), while the linear structure statistics 



S and S are defined in (3.3) and (3.4). 



Theorem 3.1 shows that the reaper (resp. s-reaper) problem can find the subspace L con- 



taining the inhers whenever the (spherical) permeance statistic is sufficiently large as compared 



with the (spherical) linear structure statistic. Be aware that the inequalities (3.5) and (3.6) are 



not comparable; neither one implies the other. The main difference is due to the fact that the 



condition (3.6) for s-REAPER is invariant to the scalings of the data points. 



The proof of Theorem 3.1 appears in Appendix A.l The main idea in the argument is to verify 
that, under the appropriate sufficient condition, the objective function of reaper or s-reaper 
increases if we perturb the decision variable P away from Hl- 



3.4. Exact Recovery for the Haystack Model. The conditions in Theorem 3.1 hold in highly 
nontrivial — even surprising — regimes. In this section, we make some heuristic calculations to indi- 
cate how the spherical summary statistics behave when X is drawn from the Haystack Model on 
page[6j We prove rigorous results, including Theorem |1.1[ in Appendix [B| 

First, let us consider the spherical permeance statistic V{L). Fix a unit vector u € L, and 
suppose that a; is a random unit vector in L. The random variable {u, x) is approximately normal 
with mean zero and variance d~^, so we have E \ {u, x)\ 



^2/{TTd). It follows that 



V{L) 



E 



(u, X) 



Vd' 



We have omitted the infimum over u from (3.1) because it does not make a substantial difference. 



See Lemma B.3 for a rigorous computation. 

Next, we examine the spherical linear structure statistic S{M) = ||njv/-^out|| • This quantity is 
just the norm of a matrix with A'out columns that are i.i.d. uniform on the unit sphere in M. For 
many purposes, this type of matrix behaves like a dim(M) x A'^out Gaussian matrix whose entries 
have variance dim(Af)~^. Standard bounds for the norm of a Gaussian matrix |DSOH Thm. 2.13] 
give the estimate 

/ A^out 



S{M) 

Therefore, we can estimate the product 



dim(M) 



+ 1. 



5(1 



Aout 

D 



+ 1 



A^, 



out 



D-d 



+ 1 



< 



out 



D-d 



+ 1 



See Lemma IB. 51 for more careful calculations. 

To conclude, when the data X is drawn from the Haystack Model, the sufficient condition (3.6) 
for S-REAPER to succeed becomes 

-I 2 

A^out 



Ai„ 
d 



> 



D-d 



+ 1 



(3.7) 



Assume that D-d is comparable with D. Then the heuristic (3.7) indicates that the inlier sampling 
ratio Pin = Ni^/d should increase linearly with the the outlier sampling ratio pout = Nout/D. This 
scaling matches the rigorous bound in Theorem |1.1[ 

4. An Iterative Reweighted Least-Squares Algorithm for reaper 



In this section, we present a numerical algorithm for solving the REAPER problem (1.4). Of 



course, we can also solve s-reaper by spherizing the data before applying this algorithm. Our 
approach is based on the iterative reweighted least squares (IRLS) framework |Bjo96[ Sec. 4.5.2]. 
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At each step, we solve a least-squares problem with an adapted set of scaling parameters. The 
subproblem has a closed-form solution that we can obtain by computing an eigenvalue (or singular 
value) decomposition. Empirically, the IRLS approach exhibits linear convergence, so it can achieve 
high accuracy without an exorbitant number of iterations. It is possible to apply other types of 
optimization algorithms, such as interior-point methods, to solve reaper, but we are not aware of 
an alternative approach that scales to large datasets. 



4.1. A Weighted Least-Squares Problem. The reaper problem (1.4) cannot be solved in 



closed form, but there is a closely related problem whose minimizer can be computed efficiently. 
We use this mathematical program and its solution to build an algorithm for solving reaper. 
For each x E X , let /3x he a nonnegative weight parameter. Consider the weighted least-squares 



problem with the same constraints as (1.4): 



minimize 



E 



X 



Px\\ subject to O^P^I and ti P = d. 



(4.1) 



Algorithm 4.1 Solving the weighted least-squares problem (4.1) 



Input: 

• A set X of observations in 

• A nonnegative weight f3x for each x £ X 



• The parameter d in (4.1), where < d < D 
Output: 



• A D X D matrix that solves (4.1) 
Procedure: 

1 Form the D x D weighted covariance matrix 



XX 



2 Compute an eigenvalue decomposition C = U ■ diag(Ai, . . . , Ad) • with eigenvalues in 
nonincr easing order: Ai>--->Aj:)>0 

3 if Ai^rfj+i = then 

a for i = 1, . . . , D do 

fl, i<[d\ 
i^i^ <d-[d\, i=[d\+l 
lo, otherwise 

4 else 

a for i = [d\ + 1, . . . , D do 

i Set 

I]fe=l 

ii if Aj > ^ > Aj+1 then break for 
b for i = 1, . . . , D do 

jl-i, h>9 

\o, Xi<e 

5 return P^, := U • diag(z^i, . . . , vd) • 
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Algorithm 4.1 describes a computational procedure for solving (4.1). The basic idea is to compute 



a weighted covariance matrix C from the data. We use spectral shrinkage to scale the eigenvalues 
of C to the range [0, 1]. The amount of shrinkage is determined by a water-filling calculation, which 



ensures that the output matrix the correct trace. In Appendix C.l , we present a more mathematical 



formulation of Algorithm 4.1 and we use this statement to establish correctness. 



4.1.1. Discussion and Computational Costs. The bulk of the computation in Algorithm 4.1 takes 
place during the spectral calculation in Steps 1 and 2. We need 0{ND'^) arithmetic operations to 
form the weighted covariance matrix, and the spectral calculation requires 0{D^). The remaining 
steps of the algorithm have order 0{D). 

We have presented Algorithm |4.1| in the most direct way possible. In practice, it is usually 
more efficient to form a D x N matrix W with columns y/~^x for x £ X, to compute a thin SVD 
W = IfSV^, and to set A = S-^. This approach is also more stable. In some situations, it i possible 
to accelerate the spectral calculations using randomized dimension reduction, as in |HMT11] . 

4.2. Iterative Reweighted Least-Squares Algorithm. We are prepared to develop an algo- 



objective function of (4.1), we see that 



rithm for solving the REAPER problem (1.4). Let us begin with the intuition. Suppose that solves 
the weighted least-squares problem (4.1) where f3x ~ — i-'j,a;||~ for each x G X. Examining the 



X 



P^x\ 



El 

xex 



X 



P^x\ 



Therefore, it seems plausible that is also close to the minimizer of the reaper problem (1.4) 
This heuristic is classical, and it remains valid in our setting. 



Algorithm 4.2 IRLS algorithm for solving the REAPER problem (1.4) 



Input: 

• A set X of observations in 

• The dimension parameter d in (1.4), where < d < D 

• A regularization parameter 6 > 

• A stopping tolerance e > 
Output: 

• A D X D matrix that satisfies =^ P* =^ I and trP^ 

Procedure: 

1 Initialize the variables: 

a Set the iteration counter A; 
b Set the initial error a^^^ < — hoo 
c Set the weight /3a; 1 for each x £ X 

2 do 

a Increment k ^ k + 1 



b Use Algorithm 4.1 to compute an optimal point pC^) of (4.1) with weights /3a 
c Let a^^) be the optimal value of ( |4.1| ) at P*^^) 
d Update the weights: 

(3x < rr II — „m in ^^^^ x e X 

maxjd, ||ic — P>.'^^a;|| I 

until the objective fails to decrease: a^^^ > a^''^^^ — e 

Return P^ = p(^) 
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This observation motivates an iterative procedure for solving (1.4). Let 5 be a positive regular- 
ization parameter. Set the iteration counter k 0, and set the weight /3a; 1 for each x £ X. We 
solve (4.1 ) with the weights /3a, to obtain a matrix i-'(^), and then we update the weights according 



to the formula 



Pa 



1 



for each x £ X. 



max {5, ||a; — P('^)ic|| } 

In other words, we emphasize the observations that are explained well by the current model. The 
presence of the regularization 5 ensures that no single point can gain undue influence. We increment 
k, and we repeat the process until it has converged. 

The following result shows that 



We summarize the computational procedure in Algorithm 4.2 



Algorithm 4.2 is guaranteed to converge to a point whose value is close to the optimal value of the 



REAPER problem (1.4). 



Theorem 4.1 (Convergence of IRLS). Assume that the set X of ob serv ations does not lie in the 

union of two strict subspaces o/M^. Then the iterates of Algorithm 4-2 with e = converge to a 



point Ps that satisfies the constraints of the REAPER problem (1.4). Moreover, the objective value 
at Ps satisfies the bound 



E 



X 



Psx\ 



X 



P^xW < -6\X\ 



where P-^ is an optimal point of REAPER. 



The proof of Theorem 4.1 is similar to established convergence arguments |ZL1H Thms. 6 and 8], 
which follow the schema in |CM991 IVESOj . See Appendix [C] for a summary of the proof. It can 
also be shown that the point Ps is close to the optimal point i-^ in Frobenius norm. 



4.2.1. Discussion and Computational Costs. The bulk of the computational cost in Algorithm 4.2 
is due the solution of the subproblem in Step 2b. Therefore, each iteration costs 0{ND'^). The 
algorithm converges linearly in practice, so we need 0(log(l/r/)) iterations to achieve an error of r}. 



Figure 4.1 shows that, empirically. Algorithm |4.2| exhibits linear convergence. In this experiment, 
we have drawn the data from the Haystack Model on page [6] with ambient dimension D = 100 and 
A^out = 200 outliers. The curves mark the performance for several choices of the model dimension d 
and the inlier sampling ratio p\a = Ni^/d. For this plot, we run Algorithm 4.2 with the regularization 
parameter 6 = W~^^ and the error tolerance e = 10~^^ to obtain a sequence 

{p(fc)} of iterates. We 

compare the computed iterates with an optimal point P^, of the reaper problem ( |1.4[ ) obtained by 
solving REAPER with the Matlab package CVX [GBOSi IGBlOj at the highest-precision setting. The 
error is measured in Frobenius norm. 



For synthetic data, the number of iterations required for Algorithm 4.2 seems to depend on 
the difficulty of the problem instance. Indeed, it may take as many as 200 iterations for the 
method to converge on challenging examples. In experiments with natural data, we usually obtain 
good performance after 20 iterations or so. In practice. Algorithm |4.2| is also much faster than 



algorithms [ XCSlObl IMTllj for solving the low-leverage decomposition problem (2.2). 



The stopping criterion in Algorithm 4.2 is motivated by the fact that the objective value must 
decrease in each iteration. This result is a consequence of the proof of Theorem 4.1 see (C.7). 



Taking e on the order of machine precision ensures that the algorithm terminates when the iterates 
are dominated by numerical errors. In practice, we achieve very precise results when e = 10~^^ or 
even e = 0. In many applications, this degree of care is excessive, and we can obtain a reasonable 
solution for much larger values of e. 

Theorem |4. 1 1 requires a technical condition, namely that the observations do not lie in the union 
of two strict subspaces. This condition cannot hold unless we have at least 2D — 1 observations. 
In practice, we find that this restriction is unnecessary for the algorithm to converge. It seems to 
be an artifact of the proof technique. 
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Convergence of IRLS, d = 1 




Convergence of IRLS, d = 10 



50 100 150 200 
Iteration number k 




50 100 150 200 
Iteration number k 



Figure 4.1. Convergence of IRLS to an optimal point. The data are drawn from 
the Haystack Model on page [6] with ambient dimension D = 100 and A'out = 200 
outUers. Each curve is associated with a particular choice of model dimension d 
and inlier sampling ratio = N\^/d. We use Algorithm 4.2 to compute a se- 



quence of iterates, which we compare to an optimal point i-^ of the reaper 

problem (1.4). See the text in Section 4.2 for more details of the experiment. 



5. Experiments 

In this section, we present some numerical experiments involving reaper and s-reaper. To 
probe the limits of these optimization problems for fitting a linear model in the presence of outliers, 
we study how they behave on some simple random data models. To gauge their performance in 
more realistic settings, we consider some stylized problems involving natural data. 



5.1. Experimental Setup. To solv e the reaper problem (1.4) and the s-reaper problem (1.6), 
we use the IRLS method. Algorithm 4.2 We set the regularization parameter 5 = 10^^^ and the 



stopping tolerance e = 10~^^. 

In all the cases considered here, the dimension parameter d is a positive integer. Unless we state 
otherwise, we postprocess the computed optimal point P^, of reaper or s-reaper to obtain a 
d-dimensional linear model. When has rank greater than d, we construct a d- dimensional linear 
model by using the span of the d dominant eigenvectors of P^,. Our goal is to use a consistent 
methodology that involves no parameter tweaking. Other truncation strategies are also possible, 
as we discuss in Section [6.1.61 

5.2. Comparisons. By now, there are a huge number of proposals for robust linear modeling, so 
we have limited our attention to methods that are computationally tractable. That is, we consider 
only formulations that have a polynomial-time algorithm for constructing a global solution (up 
to some tolerance). We do not discuss techniques that involve Monte Carlo simulation, nonlinear 
programming, etc. because the success of these approaches depends largely on parameter settings 
and providence. As a consequence, it is hard to evaluate their behavior in a consistent way. 
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We consider two standard approaches, PCA |Jol02) and spherical PCA |LMS"'"99] . Spherical PCA 
rescales each observation so it lies on the Euclidean sphere, and then it applies standard PCA. In 
a recent book, Maronna et al. |MMY06] recommend spherical PCA as the most reliable classical 
algorithm, which provides a second justification for omitting many other established methods. 

We also consider a more recent proposal [ XCSlOal IXCSlObl IMTllj . which is called low-leverage 
decomposition (LLD) or outlier pursuit. This method decomposes the D x N matrix X of obser- 
vations by solving the optimization problem 



minimize 



+ 7 l|C| 



l-s>2 



subject to X = P + C (5.1) 

where \\-\\s^ is the Schatten 1-norm and ||-||*_5.2 returns the sum of Euclidean norms of the column. 
The idea is that the optimizer (i-'^,C*) will consist of a low-rank model i-*^ for the data along 
with a column-sparse matrix that identifies the outliers. We always use the parameter choice 
7 = 0.8^J D/N , which seems to be effective in practice. 

We do not make comparisons with the rank-sparsity decomposition [CSP WlT] . which has also 
been considered for robust linear modeling in [CLMWlT] . It is not effective for the problems that 
we consider here. 

5.3. Phase Transitions for Exact Recovery of Subspaces. This section studies the limits of 
the S- REAPER problem by applying it to data drawn from the Haystack Model on page [6} The goal 
is to understand when S-REAPER can recover a low-dimensional subspace exactly, even when the 
dataset contains many unstructured outliers. Theorem 1.1 tells us that the inlier sampling ratio 



Needle in D=3 dimensions 




^2, 4 6, 8. 
Outlier oversampling ratio p 



Out 



Needle in D=100 dimensions 
14 




Outlier oversamp^ng ratfo p 



Out 



Figure 5.1. Exact recovery of a needle in a haystack via s-reaper. The data is 
drawn from the Haystack Model on page [6] with subspace dimension d = 1. The 
ambient dimension D = 3 [left] and D = 100 [right]. The blue curves show the 



theoretical 50% success threshold given by ( |B.lc ). The red circles mark the empirical 
50% success threshold for each value of pin, and the yellow lines indicate the least- 
squares fit. The trends are = 1.41/?out + 0.59 [left] and = 1.53/9out + 2.68 



[right]. See Section 5.3 for details. 
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Pin = N\a/d and the outlier sampling ratio pout = ^ont/D are relevant to exact recovery. Indeed, 
we expect to see a linear relationship between these two parameters. 

In these experiments, we fix the ambient dimension D and the dimension d of the subspace L. 
The number A'in of inliers and the number A'out of outliers vary. Note that the specific choice of the 
subspace L is immaterial because the model is rotationally invariant. The variance parameters can 
be equated {af^ = a'^^^ = 1) because they do not play any role in the performance of s-reaper. 
We want to determine when the solution P^, to the s-reaper problem equals the projector 11^ 
onto the model subspace, so we declare the experiment a success when the spectral-norm error 
\\Pi, — HlW < 10~^. For each pair {pm,Pout), we repeat the experiment 25 times and calculate an 
empirical success probability. 



Figure 5.1 charts the performance of s-reaper for recovering a one-dimensional subspace in 
ambient dimension D = 3 and D = 100. For each value of pi^, we identify the minimum empirical 
value of Pout where the success probability is 50%, and we use least-squares to fit a linear trend 
(with Pin the independent variable). This experiment shows that the linear relationship between 
Pin and Pout is valid for the needle problem. The theoretical bound in Theorem |B.l overestimates 
the empirical trend by approximately a factor of two. 

We repeat the same experiment for larger values of the subspace dimension d. Figure |5.2| shows 
the results when {d,D) equals (2,3) or (25, 100). Once again, we see a linear relationship between 
Pin and Pout- Our theoretical bound in these cases is less satisfactory; Theorem B.l overestimates 
the trend by approximately a factor of eight. 



Sheet in a haystack: d=2, D=3 



Subspace in a haystack: d=25, D=100 




OufTler oversampTing ratio p 



Out 




Gutter oversamp^ng ratK) p 



Out 



Figure 5.2. Exact recovery of a subspace in a haystack via s-reaper. The data 
is drawn from the Haystack Model on page [ejwith {d,D) = (2,3) [left] and {d, D) = 
(25, 100) [rig'iit]. The red circles mark the empirical 50% success threshold for each 
value of Pin, and the yellow lines indicate the least-squares fit. The trends are 
Pin = 1.52pout + 0.70 [left] and pir 
details. 



1.66pout + 3.14 [rigiit]. See Section 5.3 for 
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Mean angle from S-Reaper dominant component 
to true subspace (d=1 , D=100) 




Figure 5.3. Mean angle between the needle and the computed model. The data is 
drawn from the Haystack Model on page [6]with {d,D) = (1,100). The image shows 
the empirical mean, over 25 trials, of the angle between the true subspace and a 



dominant eigenvector of a solution to the s-reaper problem ( |1.6| ). See Section 5.3.1 
for details. 



5.3.1. Graceful Degradation after Recovery Threshold. The phase transition plots in Figure 5.1 
and |5.2| are misleading because s-reaper continues to work well, even when it does not recover 
the true subspace exactly. A different view of the data from the phase transition experiments in 



Section [5.3| captures how our procedure breaks down. Figure 5.3 shows the empirical mean angle 
between the true subspace L and the dominant eigenvector of the solution to the s-reaper 
problem. The angle increases gradually as the outlier sampling ratio increases. We have also found 
that the solution of s-reaper degrades gracefully when the dimension d of the subspace is larger. 

5.4. Faces in a Crowd. The next experiment is designed to test how well several robust methods 
are able to fit a linear model to face images that are dispersed in a collection of random images. 
This setup allows us to study how well the robust model generalizes to faces we have not seen. 

We draw 64 images of a single face under different illuminations from the Extended Yale Face 
Database |LHK05j . We use the first 32 faces for the sample, and we reserve the other 32 for the out- 
of-sample test. Next, we add 400 additional random images from the BACKGROUND/Google folder of 
the CaltechlOl database |Cal061 [FFFP04] . Each image is converted to grayscale and downsampled 



to 20 X 20 pixels. We center the images by subtracting the Euclidean median (1.5). Then we 
apply PCA, spherical PCA, LLD, reaper, and s-REAPER to fit a nine-dimensional subspace to the 
da ta. See |BJ03j for justification of the choice d = 9. This experiment is similar to work reported 
in |LLY+10| Sec". VI]. 

Figure [F!4| displays several images from the sample projected onto the computed nine-dimensional 
subspace (with the centering added back after projection). For every method, the projection of 
an in-sample face image onto the subspace is recognizable as a face. Meanwhile, the out-of-sample 
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Figure 5.4. Face images projected onto nine-dimensional linear model. The 
dataset consists of 32 images of a single face under different illuminations and 400 
random images from the CaltechlOl database. The original images [left column] 
are projected onto the nine-dimensional subspaces computed using five different 
modeling techniques. The first two rows indicate how well the models explain the 
in-sample faces versus the random images. The last two rows show projections of 
two out-of-sample faces, which were not used to compute the linear models. See 



Section 5.4 for details. 



faces are described poorly by the PCA subspace. All of the robust subspaces capture the facial 
features better, with s-reaper producing the clearest images. 



Figure 5.5 shows the ordered distances of the 32 out-of-sample faces to the robust linear model 
as a function of the ordered distances to the model computed with PCA. A point below the 1:1 line 
means that the ith closest point is closer to the robust model than the ith closest point is to the 
PCA model. Under this metric, s-reaper is the dominant method, which explains the qualitative 
behavior seen in Figure 5.4 This plot clearly demonstrates that s-reaper computes a subspace 



that generalizes better than the subspaces obtained with the other robust methods. Indeed, LLD 
performs only slightly better than PCA. 
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Figure 5.5. Approximation of out-of-sample face images by several linear models. 
The ordered distances of the out-of-sample face images to each robust model as a 
function of the ordered distances to the PCA model. The model was computed 
from 32 in-sample images; this graph shows how the model generalizes to the 32 



out-of-sample images. Lower is better. See Section 5.4 for details 



5.5. Low-Dimensional Modeling of Textures. Our next experiment involves an image of a 
structured texture (a pegboard |SIP121 File 1.5.02]) that has been occluded with an unstructured 
texture (sand [SIP 121, File 1.5.04]). Shifts of a structured texture can be explained well with a low- 
dimensional model, while unstructured textures look like (anisotropic) noise. This type of model 
might be interesting for inpainting applications. 



See Figure 5.6 for the images we use. The occlusion covers over 80% of the original image, and it 
is scaled so that the original image is faint in comparison with the occlusion. We block the occluded 
image into 24 x 24 patches, and we vectorize the resulting 441 patches to obtain observations in 



a 576-dimensional space. We center the data about the Euclidean median (1.5) of the patches. 
We compute a three-dimensional linear model for the patches in the occluded image using PCA, 
spherical PCA, LLD, reaper, and s-reaper. 

Figure |5.7| shows the reconstruction of some patches from the original, unoccluded image. It 
appears that S-REAPER and REAPER reproduce the structured texture better than PCA or spherical 



PCA. Figure 5.8 shows a plot of the ordered distances of patches in the original image to the robust 
linear model versus the ordered distances to the model computed by PCA. Note that the models 
are all computed from the occluded image. 

The plot confirms the qualitative behavior seen in Figure |5.7[ the regression planes determined 
by REAPER and S-REAPER capture the energy of the patches in the original image better than 
PCA and spherical PCA. Here, the advantage of spherization coupled with reaper is clear: the 
performance of S-REAPER is significantly better than either spherical PCA or reaper. 
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Figure 5.6. Original image and occluded image. The original [left] and occluded 
[right] images used in this experiment. The yellow gridlines demarcate the bound- 
aries of the 24 X 24 patches. 
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Figure 5.7. Reconstruction of some patches from the original texture. The left 
column shows three patches from the original texture that are completely hidden in 
the occluded image. The next four columns show the projections of these patches 
onto the three-dimensional subspace computed from the occluded image. We omit 
the reconstruction obtained with LLD because it is substantially the same as the 



result with PCA. See Section 5.5 for details. 
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Figure 5.8. Approximation of original image patches by several linear models. 
We use four robust methods to determine a three-dimensional hnear model for the 
patches in the occluded image, Figure |5.6[ The plot shows the ordered distances of 



the patches in the original image to each robust models as a function of the ordered 
distances to the PCA model. Lower is better. See Section 15.51 for details. 



6. Extensions and Modifications 

There are many opportunities for research on robust linear models. For example, it would also 
be valuable to extend the analysis in this work to more sophisticated data models and to consider 
more realistic experimental setups. There are also many variants of the reaper problem that may 
be effective, and we believe that our formulations could be extended to more challenging problems. 
In this section, we survey some of the possibilities for future work. 



The reaper formulation ( 1.4 ) is based on a relaxation 



6.1. Variations on the reaper Problem 
of the ii orthogonal distance problem ( 1.3 ). First, we examine what happens if we try to strengthen 
or weaken the constraints in REAPER. Next, we discuss some alternatives for the ii objective 
function that we have applied to the residuals in both (1.3) and (1.4). Third, we show that it is 



possible to incorporate a centering step directly into the optimization. Last, we describe several 
methods for constructing a linear model from the solution to the REAPER problem. 



6.1.1. A Stronger Relaxation? The constraint set in (1.3) consists of all trace-d orthoprojectors 



where d is a positive integer. In the reaper problem, we relax this constraint to include all trace-d 
matrices whose eigenvalues fall in [0, 1]. One may wonder whether a tighter relaxation is possible. 
If we restrict our attention to convex sets, the answer is negative. 
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Fact 6.1. For each integer d G [0, D], the set {P G M^^^ : ^ P =^ I and tr P = d} is the convex 
hull of the D X D orthoprojectors with trace d. 

To prove this fact, it suffices to apply a diagonalization argument and to check that the set 
{A € : J2iLi \ = d and < Aj < 1} is the convex hull of the set of vectors that have d ones 
and D — d zeros. See |OW92j for a discussion of this result. 

6.1.2. A Weaker Relaxation? Another question is whether a weaker relaxation of reaper might 
be effective. As it turns out, only one of the semidefinite constraints is really necessary. Consider 
the modified optimization problem 

minimize ^ ||a; — Pa;|| subject to =^ P and ii P = d. (6-1) 
This problem may be substantially easier to solve in practice with general-purpose software such 



as CVX |GB081 iGBlOj . The modification (6.1) is equivalent to (1.4) in the following sense. 



Theorem 6.2 (Weak reaper). The problem (6.1) has an optimal point P^ that satisfies P^ ^ I 



We provide a constructive proof of Theorem 6.2 in Appendix O that converts any optimal point 



of (6.1) into an optimal point that satisfies the extra semidefinite constraint. 



6.1.3. Deadzone Formulation. In the next two sections, we consider some ways to adapt reaper 
that preserve convexity and yet may improve robustness. 

First, we consider a modified objective function that ignores small residuals entirely so that the 
large residuals are penalized even more severely. 

minimize ^ max{0, ||a; — Pa;|| — 5} subject to =^ P =^ I and tv P = d. (6.2) 

In other words, no point contributes to the objective unless the model explains it poorly. The 
positive parameter 6 controls the width of the deadzone. This approach is similar in spirit to a 



soft-margin classifier |HTF09j . One conceptual disadvantage of the deadzone formulation (6.2) is 
that it probably does not admit an exact recovery guarantee. 

6.1.4. Outlier Identification and Model Fitting. In practice, a two-step approach can lead to sub- 
stantially better linear models for data. First, we use a robust method to find a subspace that fits 
the data. Next, we identify and discard points that are poorly explained by the model. Finally, we 
fit a subspace to the reduced data, either using a robust method or a classical method. See |RL87| 
for an exposition of this idea. 

We can amplify this approach into an iterative reweighted ii method |CWB08] IKXAHIO] . Set 
the iteration counter /c 0, and set /3a; 1 for each x £ X. At each step, we solve a weighted 
version of the reaper problem: 

minimize f^x \\x — Px\\ subject to =<; P =<; I and trP = d. (6-3) 

Then we update the weights according to some rule. For example, 

(3x < rr II — r,^k■^ in ^^^^ x e X. 

maxjo, ||a; — P^'^-'£c|| I 



By canceling off the magnitude of the residual in (6.3), we hope to minimize the number of points 
that are poorly explained by the model. We repeat this procedure until we do not see any additional 
improvement. In a small experiment, we found that a single reweighting step already provided most 
of the benefit. 
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6.1.5. Incorporating Centering into reaper. In Section 1.4.1, we discussed the importance of cen- 
tering data before fitting a subspace. In this paragraph, we consider a direct method for finding an 
affine space that explains the data instead of using a separate centering step. There is a natural 
convex formulation for fitting an affine model to data. 

minimize ^ ||ic — Pa; — c|| subject to O^P^I and tv P = d. (6-4) 



The decision variables here are the center c and the symmetric matrix P. A variant of Algorithm 4.2 



can be used to solve this problem efficiently. We have obtained deterministic recovery guarantees 



for (6.4) when the inliers are contained inside an affine space. For brevity, we omit the details. 



The main shortcoming of (6.4) is that we do not see a way to spherize the centered data within 
the framework of convex optimization. We have found that s-REAPER is often more powerful than 
REAPER, so we believe it is better to center and spherize the data before fitting a model. 

6.1.6. Rounding the Solution. For real data, it is common that the solution P^ to the reaper 
problem has rank that exceeds d. In our numerical work, we select a d-dimensional model by 
forming the span of the d dominant eigenvectors of -P^. This approach is effective in practice, but 
it hardly seems canonical. It is likely that there are other good ways to achieve the same end. 

Let us mention three alternative ideas. First, we might take the entire range of as our model. 
It may have more dimensions than we specified, but it will surely provide a better fit than the 
lower-dimensional subspace. 

Second, we can adjust the parameter in the reaper problem until the solution has rank d. A 
quick way to accomplish this is to apply bisection on the interval [0,d]. This method also works 
well in practice. 

Third, we can randomly round the matrix P^ to a subspace with expected dimension d. To do 
so, let {Xi,Ui) be the eigenpairs of P^. Draw Bernoulli variables 5i with success probability Aj, 
and form spanjtij : 5i = 1}. This approach is inspired by work on randomized rounding of linear 
programming solutions to combinatorial problems. 

6.2. Extensions of the Analysis. The theoretical analysis in this paper is designed to provide an 
explanation for the numerical experiments that we have conducted. Nevertheless, there are many 
ways in which the simple results here could be improved. 

6.2.1. Refining the Data Models. The In & Out Model on page|10|is designed to capture the essential 
attributes of a dataset where the inliers are restricted to a subspace while the outliers can appear 
anywhere in the ambient space. The permeance statistic and the linear structure statistic capture 
the basic properties of the data that allow us to establish exact recovery. It would be interesting 
to identify geometric properties of the data that allow us to control these statistics. 

The Haystack Model on page |6] is a very special case of the In & Out Model, where the inliers are 
isotropic normal variables supported on a subspace and the outliers are isotropic normal variables 
on the ambient space. It would be more realistic to allow anisotropic data distributions and to 
consider extensions beyond the Gaussian case. 

Another worthwhile direction is to study the performance of REAPER and S-REAPER when the 
inliers are contaminated with noise. In this case, we cannot hope for exact recovery guarantees, 
but experiments indicate that the optimization methods are still effective at finding linear models 
that are comparable with the performance of oracle PCA on the inliers. 

6.2.2. Connections with Other Algorithms. Recht [Reel 2] has observed that the IRLS algorithm is 
very similar to an EM algorithm for the normal mixture model. There are also evident parallels with 
the multiplicative weight framework |Kal07j that is studied by the online algorithms community. 
It would be interesting to understand whether there are deeper connections among these methods. 
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6.3. Harder Problems. A very large number of applications require that we find a linear model 
for a noisy dataset. It is also common that we need to fit a more complicated model that is built 
out of simpler linear structures. The methods in this work are also relevant in these situations. 

6.3.1. Manifold Learning. A common component in methods for manifold learning is a local PCA 
step (see, e.g., |LV07j ). The idea is to look at small neighborhoods in the data and to fit a linear 
model to each neighborhood. The algorithms then stitch these pieces together to form a manifold. 
In this application, it is common for the data to contain outliers because the manifold usually has 
some curvature and the neighborhood may also contain points from different parts of the manifold. 
Robust methods for linear modeling could be very helpful in this setting. 

6.3.2. Hybrid Linear Modeling. There are many problems where the data lies on a union of sub- 
spaces, rather than a single subspace. The techniques in this paper are very relevant for this 
problem because, from the point of view of each subspace, the observations from the other sub- 
spaces are outliers. Suppose that X is a, set of observations, and we would like to fit m subspaces 
with dimensions di, . . . ,dm.- The following optimization offers one way to approach this problem. 

minimize ^ ||a; — P,;a;|| subject to ^ Pi =4 I, trPi = dj, and 



\Pl P2 ■ ■ ■ Pm] 



<l + a. (6.5) 



We need the third constraint to prevent the decision variables Pi from coalescing. The parameter 
a > reflects how much overlap among the subspaces we are willing to tolerate. 

It would be very interesting to study the empirical performance of this approach and to develop 
theory that explains when it can reliably fit hybrid linear models to data. The optimization 



problem (6.5) is inspired by the formulations in |ZSL09|. iLZllj . 



6.3.3. Generalized Linear Models. In this paper, we have focused on a robust method for fitting a 
standard linear model. Generalized linear models (GLMs) also play an important role in statistical 
practice. Unfortunately, the specific methods in this paper do not apply directly to GLMs. To 
make the extension, we need to replace the squared Euclidean loss with a Bregman divergence 
that is derived from the exponential family of the GLM |BMDG05] . In this setting, Bregman 
projectors play the role of orthogonal projectors. It would be interesting to develop a method for 
parameterizing the Bregman projector onto a subspace as a function of the subspace. Then, we 
might study how to relax this formulation to obtain a convex optimization problem. 

6.4. Conclusions. We have found that the reaper and s-reaper formulations provide a rigorous 
and effective method for fitting low-dimensional linear models to data. In future work, we hope to 
improve our theoretical understanding of these methods, and we plan to extend them to related 
problems. We hope that practitioners will try out these techniques to enhance data analysis in 
science and engineering applications. 
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Appendix A. Exact Recovery Conditions for the In & Out Model 



In this appendix, we establish Theorem 3.1, which provides recovery conditions for the deter- 
ministic In &; Out Model. The proof is an easy consequence of the following sharper result, which 
we prove in Section A.l below. 

Theorem A.l. Let L be a d- dimensional subspace of M.^ , and let X be a dataset that conforms 
to the In & Out Model on page \10[ Define a matrix Xout whose columns are the points in Xo^t, 
arranged in fixed order. Suppose that 

inf J2 K""' *)l 

\\u\\=l "'^^in 



> V2d ■ max III (n^± ^out) (n^^x^out 
Then TIl is the unique minimizer of the REAPER problem (11.4 



{n^TX^t){nLXoutf\\}. (a.i 



Proof of Theorem 3.1 from Theorem A.l. First, we check that the sufficient condition (3.5) for 
REAPER follows from (A.l). The permeance statistic (|3.1|) equals the left-hand side of (A.l): 



V{L):= inf K"' 

\\u\\=i xeX;^ 



We can bound the right-hand side of (|A.1|) using the linear structure statistics (|3.3|) and (|3.4|). 

n^x ^out 



(nixXout)(n^xX, 



out ) 



< 



IX, 



out I 



The inequality follows because the spectral norm is submultiplicative and ||n^x|| = 1. Similarly, 

(iifT%7t)(nLXout)' < cS(L^) • cS(M^). 



Introduce the latter three displays into (A.l) to conclude that ( |3.5[ ) is a sufficient condition for 
to be the unique minimizer of reaper. 

To obtain the sufficient condition for S-REAPER, we need to apply the sufficient condition for 
reaper to the spherized data. In this case, the scale-invariant permeance statistic (3.2) matches 
the left-hand side of (A.l): 

V{L)= inf K^' 

||ii||=i xeXi,-, 



On the right-hand side of (A.l), we replace each instance of Xout with the spherized matrix X, 
To complete this step, observe that two separate normalizations are redundant: 

n^xXout = n^^xXout- 



out • 



Therefore, 



Likewise, 



(n^xXout)(n^xXout)* 



< 



n^x Xout 



X, 



out 



S{L^)-S{^^). 



(n^xXout)(nLX, 



out j 



Introduce the latter three displays into (A.l) to conclude that ( |3.6[ ) is a sufficient condition for IIx, 
to be the unique minimizer of s-reaper. □ 
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A.l. Proof of Theorem A.l[ Throughout this section, we use the notation from the In & Out 



Model on page 10 In particular, recall that L is a d-dimensional subspace of M^, and let X 



Xm U Xc,,i be the set formed from a set X\n of inliers located in L and a set Afnnt of outliers located 



m 



\L. 



The proof of Theorem |A.1| is based on a perturbative argument. Observe that P = 11^ + A is 
feasible for the reaper problem (1.4) if and only if the perturbation A satisfies 



0=?;nL + A=<;I and trA = 0. 



(A.2) 



We demonstrate that, under the condition (A.l), the objective of reaper evaluated at + A is 



strictly larger than the objective evaluated at 11^. 



To establish Theorem A.l, we proceed through a series of (very) technical results. The first 



lemma identifies a variational inequality that is sufficient for the overall result to hold. The proof 
appears below in Section |A.2 



Lemma A.2 (Sufficient Conditions for Optimality). Decompose the matrix A into blocks: 

A = Ul/^Ul + n^x ahl + Hi An^x + n^x An^x (a.3) 



Suppose that, for all A. ^ that satisfy ( |A.2[ ), we have the conditions 

IIAi^ll > V2 (as, (nI7^)(nix 



and 



(A.4) 
(A.5) 



Then JJ^ is the unique minimizer of the REAPER problem (1.4) 



We must check that the two conditions in Lemma A.2 hold under the hypothesis (A.l) of Theo- 
rem 



A.l The proof of this result appears below in Section A.5 

iditions). The ineq 



Lemma A.3 (Checking the Conditions). The inequality (A.4) holds when 
1 



inf 



n^xx 



out , 



n^xx 



out 



The inequality (A.5) holds when 



1 



inf K"' x)\>V2 



I n_LXout 



(A.6) 



(A.7) 



We may now complete the proof of Theorem A.l 



Proof of Theorem A.l from Lemmas A.S\ and A.3, The hypothesis ( A.l ) implies the two conditions 
in Lemma A.3 Therefore, the two conditions in Lemma A.2 are in force, which means that Hl is 
the unique minimizer of REAPER. □ 



A.2. Proof of Lemma A.2, The reaper problem (1.4) is convex, so it suffices to check that the 



objective increases for every sufficiently small nonzero perturbation A that satisfies the feasibility 



condition (|A.2|). The difference between the perturbed and the unperturbed objective values can 

r?(A) := Y l\mL^-^M-\\UL. 



be expressed as 



(A. 



because Ili± 



TIl- Our goal is to show that r/(A) > whenever A is small and feasible. 
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We split the expression for r/(A) into a sum over inliers and a sum over outliers. Each inlier 
X G A^in falls in the subspace L, so that Tli±x = 0. Therefore, 



lllr I a;| 



I Atcl 



for X £ 



(A.9) 



To evaluate the summands in (|A.8|) where x G ^out, recall that each outlier has a nontrivial 

|2ll/2 



projection on the subspace L . When A — )• 0, we compute that 

\\{Ul± - A)x\\ = [\\nL±xf - 2 {Ax, Ul±x) + II Aa; 

2 



\Ul±x 



1 



|nr±a;| 



Ax, nL±x) + o{ II A 



1/2 



= lln^xicll - i^Ax, nL±xj + 0[ II A|p ) for a; e Xont- 
The last identity follows from the Taylor development of the square root. In summary, as A — t- 0, 



(n 



A)x\ 



\Ul±x\ 



Ax, Tli±x 



O \\A\ 



for X £ X, 



out' 



(A.IO) 



Introduce the inequality (A.9) for the inlier terms and the inequality (A.IO) for the outlier terms 
into the definition ( |A.8[ ) to obtain 



O \\A\ 



when A — >• 0. 



Therefore, ??(A) > for all nonzero feasible perturbations provided that 

||Aa;|| > ^ (^Ax, n^xa;^ for each A / that satisfies ( |A.2[ ). 



(A.ll) 



In fact, the same argument demonstrates that (A.ll) becomes a necessary condition for 11^ to 
minimize (1.4) as soon as we replace the strong inequality by a weak inequality. 



To continue, we use the block decomposition (A. 3) to break the sufficient condition (A.ll) into 
two pieces that we can check separately. First, note that II^xA = A2TIL + AsII^x, so the 
right-hand side of (A.ll) splits up as 

J2 (Acc, U^x) = {^2Ulx, U^x) + (^sn^x^c, U^x] 

XGXout Xl=Xout XGXout 

= Y {^2, (ilir^mLxY) + Y (A3, (nlra;)(n^xa;)*). (A.12) 

The second identity holds because {Ab, c) = {A, cb^). Next, recall that each inlier falls in L, and 
use orthogonality to check that 



|Aa;| 



(nL + nrx)Ania;| 



lAia^lP + ||A2x||^ for a; G Afin. 



Therefore, we can bound the left-hand side of (A.llj) below. 

121I/2 

-77^ 

xeXin 



Y l|Aa;|| = Y [l|Aia;||^ + ||A2a;| 

x£Xin XGXin 



-^T. [l|Aia;|| + ||A2a;| 



(A.13) 



Together, the assumptions (A.4) and (A. 5) imply that the right-hand side of (A.13) exceeds the 
right-hand side of (A.12). Therefore, the perturbation bound (A.ll) holds. 
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A. 3. Properties of a Feasible Perturbation. To simplify the conditions in Lemma A. 2 
need to collect some information about the properties of a feasible perturbation. 



we 



Sublemma A. 4. Suppose that A satisfies the condition (A.2) for a feasible perturbation. Then 
the following properties hold. 

(i) tr Ai + tr A3 = 0, 

(ii) Ai ^ 0, and 

(iii) A3 ^ 0. 

In addition, assume that A2 7^ 0. Then 

(iv) Ai / 0, 

(v) A3 / 0, and 

(vi) tr A3 = - tr Ai > 0. 

In particular, Ai = or A3 = implies that A2 = 0. 



Proof. Assume that A satisfies the condition (A.2) for a feasible perturbation. To prove Claim [(i 
apply the cyclicity of the trace to see that 

tr A2 = tr(ni_LAnL) = tr((nLnix)A) = tr(0) = 0. 



Take the trace of the block decomposition (A. 3) to verify that tr Ai +tr A3 = tr A = 0. 
To check (ii) , conjugate the relation =<; 11^ + A by the orthoprojector n^± to see that 

o^nix(ni + A)nix = A3. 

Similarly, Claim 



We verify (iv) 



(iii) follows when we conjugate 11^ + A ^ I by the orthoprojector 11^. 
and (v) by establishing the contrapositive. If Ai 



or A3 = 0, then A2 = 0. 
First, we argue that Ai and A3 are zero together. Indeed, assume that Ai = 0. Claim (i) implies 
that tr(A3) = 0. But then |(iii) forces A3 = 0. Likewise, A3 = implies that Ai = 0. 

Now, suppose Ai = A3 = 0. According to the Schur complement lemma |And051 Thm. 5.3], 

O^Hl + A ^ Ai ;^ and A3 ^ A2(nL + A^^^ A*2, 

where the dagger ^ denotes the Moore-Penrose pseudoinverse. Since A is feasible, the left-hand 
side of the implication is true, and so 

A3 > A2n{A\ = A2nLA*2 = A2A*. 

Therefore, A2 = because A3 = 0. 

Finally, Claim [(yi) holds because (iii) and |(v)| together imply that tr(A3) > 0. Invoke [(I)] to 
complete the proof. □ 

A. 4. Lower Bounds for the Sum over Inliers. The next result allows us to obtain lower bounds 
for the left-hand sides of the conditions in Lemma [A. 21 The statement involves the Schatten 1-norm 
||-||^^, which returns the sum of the singular values of a matrix |Bha97t p. 92]. 

Sublemma A. 5. The following relations hold for every matrix A. 

inf 



in^xAn^l 



V iinrxAnLicii > inf V iiniAnLxi 

^ tr(niAni)=i ±^ 

^ inf Yl K""' 



> 



^/d "SL 



(A.14) 
(A.15) 



Proof. To verify the relation (A.14), we show that each feasible point A for the left-hand infimum 
can be converted into a feasible point A for the right-hand infimum without changing the objective 
value. Let A be an arbitrary matrix, and fix a singular value decomposition n^xAXI^ = 
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Define the positive semidefinite matrix A = VSV*, and observe tliat A 
SV^IIi = SV*. First, the unitary invariance of the trace shows that 



Ul^^l because 



trS = tr A = tr(nLAnL). 

Next, let X be an arbitrary vector, and calculate that 

lin^iAnL^rll = ||s;F*iE|| = ha^eH = ||nLAnLa;| 



Therefore, inequality (A. 14) is valid. 

The second relation (A. 15) is more difficult. The basic idea is to replace the trace constraint 
in the left-hand infimum with a Frobenius-norm constraint. Subject to this new constraint, the 
infimum has a simpler form. First, factor out the Frobenius norm ||niAni||p from the left-hand 
side of dAlsl): 



inf 

tr(ni,Ani,)=i 



E 



\UlAUlx\ 



> inf 

tr{n£Ani)=i 



inrAn 



lIIf 



inf 

l|niAn£||p=i 



J2 \\UlAUlx\\. (A.16) 



To compute the first infimum on the right-hand side of (|A.16|), note that the rank of YIlAUl 

m 

] 



cannot exceed d because L is d-dimensional subspace. Therefore, by diagonalization, 

Vd' 



inf 

tr{niAni)=i 
nLAni,:^o 



iniAnillF = inf |||A 



1, A G 



(A.17) 



To compute the second infimum, we use a more careful diagonalization argument. We can pa- 
rameterize the feasible set using the spectral decomposition 11/, An^ = J2 
eigenvalues Aj are nonnegative and the eigenvectors Ui form an orthobasis for L. Thus, 



^ XiUiul, where the 



inf 

linLAniii, 



J2 linLAn^a:! 



inf inf V" ■- "^^ I ^""^ 

orthobascs {uj} for L ||A|| = 1 ^g^, 



X 



This expression can be simplified dramatically. Make the identification pi = Xf for each i = 1, . . . ,d. 
Observe that the objective on the right-hand side is a concave function of the vector p. Furthermore, 
p is constrained to the convex set of nonnegative vectors that sum to one (i.e., probability densities). 
The minimum of a concave function on a compact set is achieved at an extreme point, so the infimum 
occurs when p is a standard basis vector. Therefore, 

1/2 



inf 

l|A|| = 



^ 5: [£i^x^\{u,,x)\^\ 



mm 

i=l,...,d 



x)\. 



Combining the last two relations, we obtain 

J2 \\UlAUlx\ 



inf 

||ni,AnL||p=i 



inf K""' 

||ll|| = l xeXin 



To complete the proof of (A. 15), we introduce (A.17) and (A. 18) into (A.16). 



(A.18) 



□ 



A. 5. Proof of Lemma A. 3 , We are now prepared to check that the conditions in Lemma A. 3 



imply the conditions in Lemma A. 2 Throughout the argument, we assume that A is a nonzero 
perturbation that satisfies the feasibility condition (A.2), so we have access to the results in Sub- 
lemma IA.4I 



we 



First, observe that the inequality ( A.4[ ) is homogeneous in A. In view of Sublemma A. 4 
can scale A so that tr(A3) = — tr(Ai) = 1. Furthermore, we may assume that Ai is negative 
semidefinite and A3 is positive semidefinite. Therefore, it suffices to check the constrained infimum 



32 



G. LERMAN, M. MCCOY, J. A. TROPP, AND T. ZHANG 



of the left-hand side of (A.4) exceeds the constrained supremum of the right-hand side. It fohows 



immediately from Sublemma A.5| that 



1 



Meanwhile, the duality between the the Schatten 1-norm and the spectral norm gives 



^/2 



sup 

tr A3=l 



^ (A3, {Ui^^x){U^^xf) = V2 

= V2 



(nixXout)(nixX, 



out J 



The first equality holds because the matrix inside the spectral norm happens to be positive semi- 
definite. For the second identity, we simply identify the sum as the product of the two matrices. 



Combine the last two displays to see that the condition (A.4) follows from (A.6). 



A.5 



The second part of the proof is similar. Observe that the inequality (|A.5|) is homogeneous in 
A2, so we may assume that ||A2||5'^ = 1. Sublemma 

1 



delivers the bound 



inf ^ > 



^ '^l E 1^"' "^1- 



The duality argument we used for the first condition now delivers 



^/2 



sup 

l|A2|ls^=l 



^ (A2, {iiL^x){ULxy) = V2 (n^xXout)(nLXout)* 



Combine these two displays to see that (A.5) is a consequence of (A. 7). 



Appendix B. Exact Recovery Conditions for the Haystack Model 



In this appendix, we establish exact recovery conditions for the Haystack Model. To accomplish 
this goal, we study the probabilistic behavior of the (spherical) permeance statistic and the (spher- 
ical) linear structure statistic. Our main result for the Haystack Model, Theorem B.l , follows when 
we introduce the probability bounds into the deterministic recovery result. Theorem |3.1[ Theo- 
rem [LT| the simplified result for the Haystack Model, is a consequence of the more detailed theory 
we develop here. 

Theorem B.l. Fix a parameter c > 0. Let L be an arbitrary d- dimensional subspace ofM.^, and 
draw the dataset X at random according to the Haystack Model on page 

(i) Let 1 < d < D — 1. Suppose that 

I / r-r\ 2 

fout 



(2 + c)^> 



2D 
D-d- 0.5 




B.la 



Then Hi is the unique minimum 0/ REAPER (1.4), except with probability 3.5e ^^'^Z^. 
(ii) Let 2 < d < D — 1. Suppose that 

2 



Pir 



(2 + cV2)^ > 



D 



2D 



D-0.5 y D-d- 0.5 




(B.lb) 



Then TIl is the unique minimum o/s-REAPER (1.6), except with probability 4e '^^'^Z^. 
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(iii) Let d = 1 and D > 2. Suppose that 

~^D~ ( 

VPout 
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> 



D 



D-0.5 y D-1.5 



+ 1 + C,/- 



(B.lc) 



Then Hl is the unique minimum o/s-REAPER (1.6), except with probability 3e 



-c2/2 



We prove these rather daunting expressions below in Sections B.l and B.2 First, let us examine 
the content of Theorem B.l in a specific regime. Suppose that 

— 7- oo, d^oo, and d/D^O. 



Then the condition (B.lb) simplifies to 

An > \/7r • (VPout + 1)^- 

This inequality matches our heuristic (3.7) exactly! Now, we demonstrate that Theorem B.l 



con- 



tains the simplified result for the Haystack model. Theorem 1.1 



Proof of Theorem \l.l\ from Theorem B.l. To begin, we collect some basic inequalities. For a > 0, 
the function f[x) = x — a^fx is convex when x > 0, so that 

f{x)>f{o?) + f\a'){x-c?) = \{x 



X — ayx 



a 



For \ < d < [D — l)/2, we have the numerical bounds 

D ^ D 



< 2, 



d 



D-d- 0.5 

Finally, recall that (a + 6 + c)^ < 3(a^ + 6^ + c^) as a consequence of Holder's inequality. 

To prove the claim (1.7) of Theorem |1.1[ we apply our numerical inequalities and the fact 
(2 + c)^ < 2(4 + c^) to weaken ( |B.la[ ) to the condition 

^ (pout + l + 2c2) 



Pir 



vr(4 + c^) > 12 



Apply Theorem B.l 'i) with c = \/2^ to obtain ( |1.7[ ) with constants Ci = 47r < 13 and C2 = 27r < 7 

and C3 = 12y7r/2 < 16. 

To prove the claim (1.8) of Theorem |1.1| we use the same reasoning to weaken (B.lb) to the 
condition 

Pin - vr(4 + 2c2) > I2J ^(pout + 1 + 2c2) 



Apply Theorem B.l u) with c = \/2^ to obtain (1.8) in the range 2 < d < {D — l)/2. The case 
d = I follows from a similar argument using Theorem B.lpiT 



□ 



B.l. Probability Inequalities for the Summary Statistics. The proof of Theorem B.l re- 
quires probability bounds on the permeance statistic V, the linear structure statistic S, and their 
spherical counterparts V and S. These bounds follow from tail inequalities for Gaussian and spher- 
ically distributed random vectors that we develop in the next two subsections. 

B.1.1. The Permeance Statistic. Our first result is used to provide a high-probability lower bound 
for the permeance statistic 'P{L) under the Haystack Model. 



Lemma B.2. Suppose gi, . . . ,gn are i.i.d. normal(0,I) vectors in 

" [2 . 

,^^}^ "^liu, gi)\> \- ■n-2Vnd 
j|ti,||=i V vr 



i=l 



For each t > 0, 
ty/n, (B.2) 



except with probability e 



34 



G. LERMAN, M. MCCOY, J. A. TROPP, AND T. ZHANG 



Proof. Add and subtract the mean from each summand on the left-hand side of (B.2) to obtain 

n n n 

inf ^|(«, 5,)|> inf »,)|-EKti, inf Y.^\{u, g,)\ (B.3) 

||tt|| = l — ; l|ti.||=l — ; ||it|| = l — ; 



i=l 



1=1 



i=l 



The second sum on the right-hand side has a closed form expression because each term is the 
expectation of a half-Gaussian random variable: E | {u, Qi) \ = y^2/7r for every unit vector u. 
Therefore, 



inf J2^\{u, g,)\ 



u =1 



i=l 



n. 



(B.4) 



To control the first sum on the right-hand side of (B.3), we use a standard argument. To bound 
the mean, we symmetrize the sum and invoke a comparison theorem. To control the probability of 
a large deviation, we apply a measure concentration argument. 

To proceed with the calculation of the mean, we use the Rademacher symmetrization 
lemma |LT91l Lem. 6.3] to obtain 



E sup [(lEK^, 9i 
lhll=i 1=1 



{u, gi)\] <2E sup ^ej|(it, gi)\ . 



\\u\=l 



1=1 



The random variables ei, . . . , e„ are i.i.d. Rademacher random variables that are independent from 
the Gaussian sequence. Next, invoke the Rademacher comparison theorem |LT9H Eqn. (4.20)] with 
the function ip{-) = |-| to obtain the further bound 

n n 

E sup ^ [(E|(w, 9^)1) - < 2E sup ^ e^, («, ^i) = 2E 11^"^^ . 

\H\=i i=i \\u\\=i i=i * 

The identity follows when we draw the sum into the inner product a maximize over all unit vectors. 
From here, the rest of the argument is very easy. Use Jensen's inequality to bound the expectation 
by the root-mean-square, which has a closed form: 



E sup [(lEK'", 9^ 
ll"ll=l i=l 



{u, g^)\ \ < 2 



En 
^=l^'9^ 



1/2 



2Vnd. 



(B.5) 



Note that the mean fluctuation (B.5) is dominated by the centering term (B.4) when d. 

To control the probability that the fluctuation term is large, we use a standard concentration 
inequality |Bog98[ Theorem 1.7.6] for a Lipschitz function of independent Gaussian variables. Define 

a real- valued function on dxn matrices: /(Z) = sup||^||=x Y^i=i{\f2h^—\{u, where denotes 

the ith. column of Z. Compute that 



\f{Z)-f{Z')\< sup z,-z^)| <^||zi-^^|| < V^||Z-Z'||p, 

II^INi i=i 



i=l 



Therefore, / has Lipschitz constant ^/n with respect to the Frobenius norm. In view of the esti- 



mate (B.5) for the mean, the Gaussian concentration bound implies that 

IP| sup V[(E|(n, gi)\)-\{u, g,)\] >2y/^d + tM < e'^'/^ 
UIHI=i i=i J 



Introduce the bound (B.6) and the identity (B.4) into (B.3) to complete the proof. 



(B.6) 



□ 



To control the spherical permeance statistic, we need a variant of Lemma |B.2 The following 
observations play a critical role in this argument and elsewhere. Suppose that g is uniformly 
distributed on the unit sphere in M'^, and let r be a x-distributed random variable with d degrees 
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of freedom that is independent from g. Then rg ^ g, where g is a standard normal vector. To 
apply this fact, we need some bounds for the mean of a x-distributed variable: 



Vd^^Ol<Er<Vd when 



r ~ Xd- 



(B.7) 



The upper bound follows from Jensen's inequality, while the lower bound is a refinement of 
Gautschi's inequality due to Kershaw |Ker83j . 

Lemma B.3. Suppose gi, ■ ■ ■ ,gn are i.i.d. random vectors uniformly distributed on the unit sphere 
S'^-i in R'^. When d = l, 



„inf^ UK""' 9i)\ 



n. 



Assume that d> 2. For all t > 0, 



51 K"' >\ y=-2V^-t 

\u\\=l ^ V TT A/d 



i=l 



n 



d-1 



(Bi 



(B.9) 



-t2/2 



except with probability e 

Proof. When d = 1, the argument is particularly simple. A random variable uniformly distributed 
on the 0-dimensional sphere has a Rademacher distribution. The result (B.8) is immediate. 



Assume that d >2. The proof is analogous to the argument for the Gaussian case, Lemma B.2 



Add and subtract the mean from the left-hand side of (B.9) to reach 



inf > inf X^[|(tx, ^,)|-E|(w, ^01] + inf X^E|(«, firi)|. (B.IO) 

It =1 — T It =1 — T u =1 — ; 

II II II II II II 

To bound the second term on right-hand side, we introduce independent x-variates to facilitate 
comparison with a Gaussian. Indeed, 

EKix, gi)\ = ^-Er^E~\{u, r,gi)\>^-E\{u, g.^)] > 




The last inequality follows from (B.7). Therefore, 



inf Y.^\{u, g,)\>J-. 
1^11=1 ~{ V vr 



2 n 



(B.ll) 



Next, we bound the mean of the first term on the right-hand of (B.IO). By repeating the 



symmetrization and comparison argument from Lemma B.2, we reach 



E sup gi)\)-\{u, <2 

ll"ll=i 1=1 



En ^ 



1/2 



2vn. 



The identity depends on the fact that each vector g^ has unit norm. 

To control the probability of a large deviation, we need an appropriate concentration inequality 
for the function / defined on page 34 We apply a result for a Lipschitz function on a product 
of spheres |LedOH Thm. 2.4], with the parameter c = d — 1, along with the definition [LedOU 
Prop. 1.11] of the concentration function to reach 



sup g,)\)-\{u, gi)\] >2Vn + t 

Mu\\=i i=i 



n 



d-l 



< e 



Introduce (B.12) and (B.ll) into (B.IO) to complete the proof of (B.9). 



(B.12) 

□ 
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B.1.2. Probability Bounds for the Linear Structure Statistics. To control the linear structure statis- 
tic S, we need a tail bound for the maximum singular value of a Gaussian matrix. The following 
inequality is a well-known consequence of Slepian's lemma. See [DSOll Thm. 2.13] and the er- 
rata |DS03| for details. 

Proposition B.4. Let G be an m x n matrix whose entries are i.i.d. standard normal random 
variables. For each t > 0, 

P{||G|| >V^+v^ + t}<l - < e-*'/2^ 
where is the Gaussian cumulative density function 

1 



To bound the spherical linear structure statistic S, we need a related result for random matrices 
with independent columns that are uniformly distributed on the sphere. The argument bootstraps 
from Proposition |B.4 



Lemma B.5. Let S be an m x n matrix whose columns are i.i.d. random vectors, uniformly 
distributed on the sphere S"*"-*^ in IR 



\S\\ > 



For each t > 0. 
n + -v/m + t 



< 1.5e 



-t2/2 



\/m — 0.5 

Proof. Fix > 0. The Laplace transform method shows that 



(B.13) 



P :-- 



> Vn + Vrn + t \ ^ _ jg ^e^^^^^os |1S|| ^ 

~ \Jm — 0.5 ^ 



We compare ||S|| with the norm of a Gaussian matrix by introducing a diagonal matrix of x- 
distributed variables. The rest of the argument is purely technical. 

Let r = (ri, . . . , r„) be a vector of i.i.d. x-distributed random variables with m degrees of freedom. 
Recall that riQi ~ gij, where gi is uniform on the sphere and gi is standard normal. Using the 
bound ( B.7| ), which states that \Jm — 0.5 < Er,, and Jensen's inequality, we obtain 

where G is an m x n matrix with i.i.d. standard normal entries. 

Define a random variable Z := \\G\\ — ^Jn — ^Jm^ and let := max{Z, 0} denote its positive 
part. Then 



e^* • P < E e^^ < E e^^+ 



1 + 



> r} dr. 



Apply the cdf bound in Proposition B.4 and identify the complementary error function erfc. 



• P < 1 + 



e^^ • erfc 



^\ dr. 



^/27 



A computer algebra system will report that this frightening integral has a closed form: 



erfc 



^\ dr 



^2) 



e^'/2 (erf(0) + l)-l<2e' 



02/2 



1. 



We have used the simple bound erf(0) < 1 for > 0. In summary, 

"1 



P < e 



+ e' 



02/2 



Select Q = t\jO obtain the advertised bound (B.13). 



□ 
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B.2. Proof of Theorem \BA\ Let X 

out be a D X A'out matrix whose columns are the outhers 
X G A^out, arranged in fixed order. Recall that the inlier sampling ratio pin := N[^/d and the outlier 
sampling ratio pout := Nout/D. 

First, we establish Claim (i) , The N^^ inliers are drawn from a centered Gaussian distrib ution 
on the d-dimensional space L with covariance (cTf^/d)!^. Rotational invariance and Lemma B.2 
with t = cVd, together imply that the permeance statistic (3.1) satisfies 



V{L)> 



0\n 

Vd 



(2 + c) 



except with probability e ^^'^Z^. The A'out ou tlier s are independent, centered Gaussian vectors 
^ with covariance {a'^^^^./ D)I. Proposition 



m 



structure statistic (3.3): 



B.4 



with t = cVd, delivers a bound for the linear 



< 



^ [VKut + VD + cVd\= aout 



V/Oout + 1 + CI 



except with probability e~'^^'^Z^. Rotational invariance implies that the columns of II^x Xout are 
independent vectors that are uniformly distributed on the unit sphere of a (Z) — (i)-dimensional 
space. Lemma B.5| yields 



< 



out + VD-d + c\/d 



< 



D 



D-d- 0.5 



^/P^+ 1 + CI 



except with probability 1.5 e""^^"^/^. Under the assumption (B.la), we conclude that 

V{L) > V2d-S{L^)-S{R^) 

except with probability 3.5 e~ 



-c'^d/2 



. Apply Theorem 3.1 to complete the argument. 



To establish Claim (ii) , we also need to bound the spherical permeance statistic (3.2) and the 



spherical linear structure statistic (3.4). Assume that d>2. The A^in spherized inliers are uniformly 
distributed over the unit-sphere in the d-dimensional space L. Lemma B.3, with t = c^/d, implies 
that the spherical permeance statistic satisfies 



V{L)> 



2 _ iViu 
vr ^/d 



d-1 



> Vd 



(2 + cV2)Va 



except with probability e '^^'^Z^. The second inequality follows from the numerical bound d/((i—l) < 
2 for d> 2. For the outliers, note that the A'out columns of Xpu t are independent random vectors, 
uniformly distributed on the unit sphere in M^, so Lemma B. 5 shows that 



IXo 



< 



VD - 0.5 



D 



D - 0.5 



except with probability 1.5 e ^^"'Z^. Under the assumption (B.lb), we conclude that 



ViL) > V2d-S{L^)-S{R^), 



except with prob abilit y 4e~^ Theorem 3.1 establishes the result. 

Final ly, C laim (iii) follows from a refined bound on 'P{L) that holds when d = 1. In this case. 
Lemma B.3 yields 'P(L) = A'^in- The result follows in the same manner as part (ii) 
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Appendix C. Analysis of the IRLS Algorithm 



This appendix contains the details of our analysis of the IRLS method, Algorithm 4.2 First, 
we verify that Algorithm 4.1 reliably solves the weighted least-squares subproblem (4.1). Then, we 
argue that IRLS converges to a point near the true optimum of the reaper problem (1.4). 

C.l. Solving the Weighted Least-Squares Problem. In this section, we verify that Algo- 
rithm 4.1 correctly solves the weighted least-squares problem (4.1). The following lemma provides 
a more mathematical statement of the algorithm, along with the proof of correctness. 

Lemma C.l (Solving the Weighted Least-Squares Problem). Assume that < d < D, and suppose 
that X i a set of observations in M^. For each x € X, let Px be a nonnegative weight. Form the 
weighted sample covariance matrix C , and compute its eigenvalue decomposition: 



C 



UAU^ where Ai > • • • > Ad > 0. 



xex 

When rank(C) < d, construct a vector v G via the formula 

i/:=(l, 1, d- [d\, 0, 0)^ 
' 

[d] times 

When rank(C) > d, define the positive quantity 6 implicitly by solving the equation 

^ I A.- — r/ 1 I 

d. 



(C.l) 



i=l 



Construct a vector v G M'^ whose components are 

(Ai - Q) 



A,: 



for « = 1, 



D. 



In either case, an optimal solution to (4.1) is given by 

:=U diag(i/) • U\ 

In this statement, (a)+ := max{0, a}, we enforce the convention 0/0 : = 
diagonal matrix from a vector. 



(C.2) 

(C.3) 
(C.4) 

0, and diag forms a 



Proof of Lemma C.l. First, observe that the construction (C.4) yields a matrix that satisfies 
the constraints of (4.1) in both cases. 

When rank(C7) < d, we can verify that our construction of the vector v yields a optimizer of (4.1 ) 
by showing that the objective value is zero, which is minimal. Evaluate the objective function (4.1 ) 
at the point P^ to see that 



El 



(I - P,) x\\^ = tr [(I - P,)C(I - P.)] 



D 

E 

i=l 



(1 - Xi 



(C.5) 



by definition of C and the fact that C and P^, are simultaneously diagonalizable. The nonzero 
eigenvalues of C appear among Ai, . . . , Ai^^^j . At the same time, 1 — i/j = for each i = 1, . . . , [d\. 
Therefore, the value of (C.5) equals zero at P^. 

Next, assume that rank(C) > d. The objective function in (4.1) is convex, so we can verify that 
Pt solves the optimization problem if the directional derivative of the objective at P^ is nonnegative 
in every feasible direction. A matrix A is a feasible perturbation if and only if 

=^ P^ + A =<; I and tr A = 0. 



Let A be an arbitrary matrix that satisfies these constraints. By expanding the objective of (4.1) 
about P-^, easily compute the derivative in the direction A. Therefore, the condition 

-(A, (I-P,)C) >0 (C.6) 
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ensures that the derivative increases in the direction A. 

First, note that the quantity 9 can be defined. Indeed, the left-hand side of (C.2) equals rank(C) 
when 9 = 0, and it equals zero when 9 > Xi. By continuity, there exists a value of 9 that solves 
the equation. Let be the largest index where Aj^ > 9, so that Vi = for each i > z^. Next, 
define M to be the subspace spanned by the eigenvectors lij^+i, . . . ,ud. Since i^i is the eigenvalue 
of Pi, with eigenvector Uj, we must have TlMP-k^Ai = 0. It follows that TIm^^m ^ because 

nM(-P* + A)nM ^0. 

To complete the argument, observe that 

(1 - Vi)\i = Xi - {Xi - 6*)+ = min{Aj,6'}. 
Therefore, (I — Pi,)C = U ■ diag(min{Ai, 9}) ■ Using the fact that tr A = 0, we obtain 
(A, (I - P,)C) = (A, U diag(min{A„ 9} - 9) ■ U') 

= (A, U • diag(0, . . . , 0, A,,+i - g, . . . , Ap - g) • U') 

=.z 

Since Xi < 9 for each i > i^,, each eigenvalue of Z is nonpositive. Furthermore, TImZTIm = Z. We 
see that 

(A, (I - P.)C) = (A, UmZUm) = (HmAHm, Z) < 0, 
because the compression of A on M is positive semidefinite and Z is negative semidefinite. In 



other words, (C.6) is satisfied for every feasible perturbation A about P^. □ 



C.2. Convergence of IRLS. In this section, we argue that the IRLS method of Algorithm 4.2 



converges to a point whose value is nearly optimal for the reaper problem ( 1.4 ). The proof consists 
of two phases. First, we explain how to modify the argument from |CM99j to show that the iterates 
pW converge to a matrix Ps, which is characterized as the solution to a regularized counterpart 
of REAPER. The fact that the limit point Ps achieves a near-optimal value for reaper follows from 
the characterization. 



Proof sketch for Theorem 4-1 We find it more convenient to work with the variables Q := I — P 
and Q^^^ := I — P^''\ First, let us define a regularized objective. For a parameter 6 > 0, consider 
the Huber-like function 



2 



2 \ y 



We introduce the convex function 

F{Q) := J2 HsiWQ^h WQ^W) 

~ ^{a;:||Qa;||>5} ^ 2 ^{a;:||Qa;||<5} ( S ^ ^ 

The second identity above highlights the interpretation of -F as a regularized objective function 



for (1.4) under the assignment Q = 1 — P. Note that F is continuously differentiable at each 



^ max{||Qa;|| , 5]' 



matrix Q, and the gradient 

VF(Q) = 

The technical assumption that the observations do not lie in the union of two strict subspaces of 
MP implies that F is strictly convex; compare with the proof [ZLIU Thm. 1]. We define Q5 to be 
the solution of a constrained optimization problem: 

Qs := arg min F{Q). 

tr Q=D-d 
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The strict convexity of F implies that Qs is weh defined. 



The key idea in the proof is to show that the iterates Q^^'^ of Algorithm 4.2 converge to the 



optimizer Qs of the regularized objective function F. We demonstrate that Algorithm 4.2 is a gen- 



eralized Weiszfeld method in the sense of [CM99t Sec. 4]. After defining some additional auxiliary 
functions and facts about these functions, we explain how the argument of [CM99| Lem. 5.1] can 
be adapted to prove that the iterates of Q'-'^^ = I — P^^^ — t- Qs- The only innovation required is an 
inequality from convex analysis that lets us handle the constraints =4 Q =4 ^ and tr Q = D — d. 
Now for the definitions. We introduce the potential function 

G{Q, Q('-)) := HsiWQxW, WQ^'^xW). 

Then QW) is a smooth quadratic function. By collecting terms, we may relate G and F 
through the expansion 

G{Q, Q^^^) = F{Q^^^) + {Q- Q^^\ VF(qW)) + \{Q- Q^^\ C{Q^^^){Q - Q^)) , 

where C is the continuous function 

C(Q('=)) := 



max{||Qa;|| , 5} ' 



Next, we verify some facts related to Hypothesis 4.2 and 4.3 of |CM99| Sec. 4]. Note that 
F{Q) = G{Q, Q). Furthermore, F{Q) < G{Q, Q^'')) because Hs{x,x) < Hs{x,y), which is a 
direct consequence of the AM-GM inequa lity. 



We now relate the iterates of Algorithm 4.2 to the definitions above. Given that Q^'^^ = 1 — P^^\ 



Step 2b of Algorithm 4.2 is equivalent to the iteration 

qC^+i) = argmin G(Q, Q^''^). 

tvQ=D-d 

From this characterization, we have the monotonicity property 

f(q('=+i)) < g(q(^+i), q^'^)) < G{Q^''\ g^'^)) = F(Q(*^)). (C.7) 

This fact motivates the stopping criterion for Algorithm |4.2| because it implies the objective values 
are decreasing: a('^+^) = F{Q^^^) < F{Q^^^ = a^^^. 

We also require some information regarding the bilinear form induced by G. Introduce the 
quantity m := max {(5, maXajeArlHa^H}}- Then, by symmetry of the matrix Q, and the fact that 
the inner product between positive semidefinite matrices is nonnegative we have 



Q, C{Q 



The technical assumption that the observations do not lie in two strict subspaces of MP implies in 
particular that the observations span MP . We deduce that /i > 0. 

Now we discuss the challenge imposed by the constraint set. When the minimizer Q(^+i) lies on 
the boundary of the constraint set, the equality |CM99| Eqn. (4.3)] may not hold. However, if we 
denote the gradient of G with respect to its first argument by Gq, the inequality 

= {Q- Q^^+^\ VF{Q^^^) + C7(g('=))(Q('=+i) - Q^'^))) (C.8) 




holds for every Q in the feasible set. This is simply the first-order necessary and sufficient condition 
for the constrained minimum of a smooth convex function over a convex set. 
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With the facts above, a proof that the iterates Q^^'^ converge to Qs fohows the argument 



of |CM99t Lem. 5.1] nearly hne-by-hne. However, due to inequaUty (C.8), the final conclusion 
is that, at the limit point Q, the inequality {Q — Q, VF{Q)^ > holds for all feasible Q. This 
inequality characterizes the global minimum of a convex function over a convex set, so the limit 
point must indeed be a global minimizer. That is, Q = Qs. In particular, this argument shows 
that the iterates P^'^) converge to Ps := 1 — Qs as A; — )• oo. 



The only remaining claim is that = l—Qs nearly minimizes (1.4). We abbreviate the objective 



of (1.4) under the identification Q = I — P by 

FoiQ) ■= Y..,, \\Q-\\ ■ 

Define := arg mini<o(Q) with respect to the feasible set =^ Q =^ I and tr(Q) = D — d. From 
the easy inequalities x < Hs{x,x) < x + \5 for x > 0, we see that 

< F(Q) - Fo((5) < U\X\. 

Evaluate the latter inequality at Qs, and subtract the result from the inequality evaluated at Q^, 
to reach 

(F(g.) - F{Qs)) + {FoiQs) - FoiQ.)) < h \X\ . 

Since Qs and Q^, are optimal for their respective problems, both terms in parenthesis above are 
positive, and we deduce that F(){Qs) — Fo(Q>,) < ^6 Since Fq is the objective function or (1.4) 



under the map P = I — Q, the proof is complete. □ 



Appendix D. Proof of Theorem 16.21 



We need to verify that the weak reaper problem (6.1 ) has a solution that satisfies the constraints 
of the reaper problem. The argument is constructive. Given a feasible point P for (6.1 ), we convert 
it into a matrix P' that has a smaller objective value and satisfies the constraint P' ^ I. Since 
the weak reaper problem is a relaxation of the reaper problem, we can convert a solution of the 
weak REAPER problem into a solution to the REAPERproblem. 



Let P be feasible for (6.1), and suppose that P = UAU^ is an eigenvalue decomposition with 
eigenvalues arranged in weakly decreasing order: Ai > • • • > A^:) > 0. We construct a new diagonal 
matrix A' whose diagonal elements satisfy 

(i) < A^ < 1 for each i = l,...,D. 

(ii) Ef=iA^ = trP. 

(iii) (1 - < (1 - Xi? 

Then we set P' = UA'U^. Let us demonstrate that this matrix has the required properties. 

Conditions (i) and (ii) imply that P' is feasible for reaper. We only need to check that the 
objective value of (weak) reaper is the same at P and P'. By unitary invariance, we compute 
that, for each vector x, 

11(1 - P')xf = 11(1 - A'){U'x)f = f](l - X'fiU'xf, < f:(l - X^f{U'x)j = 11(1 - P)xf. 

i=l i=l 

Summing over x ^ X, we see that the objective value of (weak) reaper at P' does not exceed the 
value at P. 

We construct the desired matrix A' by a water- filling argument. Let ii, be the smallest integer 
such that J2]*=i Xj < i*- This number is well defined because J2f^i Xj = d and < d < D. 
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Furthermore, since the eigenvalues are listed in decreasing order, 

1 

Aj < — ^ Aj < 1 for i > 

Set 



A^ = <( 1 - + a 

Ai, 



'J' 



i = u 



(D.l) 













1 - + A,- 


+ 


E A, 


«=1 









We just need to verify that A^ satisfy Conditions (i)-(iii). 
Condition (i) is evident for i ^ i-^. When i = i-^, 

i* 

Xl = 1 - + E Ai < 1 (D.2) 

because the sum is at most z^. Furthermore, since is the smallest integer with the specified 
property, we have 

Al = Ai. + E Aj - (i* - 1) > Ai, > 0. (D.3) 
i=i 

The last inequality holds because Aj > for each i. 

For Condition (ii), we split the sum of A^ into three pieces according to the cases in (D.l), and 
we compute 

EA,=trP. 

i=i 

Condition (iii) clearly holds for each i ^ i^,. For i = i^,, observe that < Aj^ < A^ < 1 because 
of (D.2) and (D.3). The claim follows. 
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